Data Level Approach for Multiclass Imbalance Financial Data

WSEAS TRANSACTIONS ON COMPUTERS ◽

10.37394/23205.2020.19.22 ◽

2020 ◽

Vol 19 ◽

Keyword(s):

Performance Measures ◽

Predictive Accuracy ◽

Hybrid Methods ◽

Class Imbalance ◽

Financial Data ◽

Classification Algorithms ◽

Minority Class ◽

Resampling Methods ◽

Imbalance Data ◽

The Impact

In the real world, the class imbalance problem is a common issue in which classifier gives more importance to the majority class whereas less importance to the minority class. In class imbalance, imbalance metrics would not be suitable to evaluate the performance of classifiers with error rate or predictive accuracy. One type of imbalance data -handling method is resampling. In this paper, three resampling methods, oversampling, under-sampling and hybrid, methods are used with different approaches for in class imbalance of two different financial data to see the impact of class imbalance ratios on performance measures of nine different classification algorithms. Aiming to achieve better change classification performance, the performance of the classification algorithms, Bayes Net, Navie Bayes, J48, Random Forest Meta-Attribute Selected Classifier, MetaClassification via Regression, Meta-Logitboost, Logistic Regression, and Decision Tree, are measured on two Canadian Banks multiclass imbalance data with the performance measures, Precision, Recall, ROC Area and Kappa Statistic, by using WEKA software. The outcome of these performance measurements compared with three different resampling methods. The results provide us with a clear picture on the overall impact of class imbalance on the classification dataset and they indicate that proposed resampling methods can also be used for in class imbalance problems

Download Full-text

Class imbalance in gradient boosting classification algorithms: Application to experimental stroke data

Statistical Methods in Medical Research ◽

10.1177/0962280220980484 ◽

2020 ◽

pp. 096228022098048

Author(s):

Olga Lyashevska ◽

Fiona Malone ◽

Eugene MacCarthy ◽

Jens Fiehler ◽

Jan-Hendrik Buhk ◽

...

Keyword(s):

Class Imbalance ◽

Imbalanced Data ◽

Medical Data ◽

Interactive Effects ◽

Experimental Stroke ◽

Gradient Boosting ◽

Classification Algorithms ◽

Minority Class ◽

Linear Relationships ◽

Boosting Algorithm

Imbalance between positive and negative outcomes, a so-called class imbalance, is a problem generally found in medical data. Imbalanced data hinder the performance of conventional classification methods which aim to improve the overall accuracy of the model without accounting for uneven distribution of the classes. To rectify this, the data can be resampled by oversampling the positive (minority) class until the classes are approximately equally represented. After that, a prediction model such as gradient boosting algorithm can be fitted with greater confidence. This classification method allows for non-linear relationships and deep interactive effects while focusing on difficult areas by iterative shifting towards problematic observations. In this study, we demonstrate application of these methods to medical data and develop a practical framework for evaluation of features contributing into the probability of stroke.

Download Full-text

Learning From Class Imbalance

Handbook of Research on Fireworks Algorithms and Swarm Intelligence - Advances in Computational Intelligence and Robotics ◽

10.4018/978-1-7998-1659-1.ch005 ◽

2020 ◽

pp. 109-129

Author(s):

Sreeja N. K.

Keyword(s):

Pattern Matching ◽

Negative Impact ◽

Statistical Tests ◽

Class Imbalance ◽

Imbalanced Data ◽

The Other ◽

Minority Class ◽

Imbalance Data ◽

Research Problems

Learning a classifier from imbalanced data is one of the most challenging research problems. Data imbalance occurs when the number of instances belonging to one class is much less than the number of instances belonging to the other class. A standard classifier is biased towards the majority class and therefore misclassifies the minority class instances. Minority class instances may be regarded as rare events or unusual patterns that could potentially have a negative impact on the society. Therefore, detection of such events is considered significant. This chapter proposes a FireWorks-based Hybrid ReSampling (FWHRS) algorithm to resample imbalance data. It is used with Weighted Pattern Matching based classifier (PMC+) for classification. FWHRS-PMC+ was evaluated on 44 imbalanced binary datasets. Experiments reveal FWHRS-PMC+ is effective in classification of imbalanced data. Empirical results were validated using non-parametric statistical tests.

Download Full-text

The impact of data difficulty factors on classification of imbalanced and concept drifting data streams

Knowledge and Information Systems ◽

10.1007/s10115-021-01560-w ◽

2021 ◽

Author(s):

Dariusz Brzezinski ◽

Leandro L. Minku ◽

Tomasz Pewinski ◽

Jerzy Stefanowski ◽

Artur Szumaczuk

Keyword(s):

Data Streams ◽

Class Imbalance ◽

Imbalanced Data ◽

Real Data ◽

Local Data ◽

Real World Data ◽

Minority Class ◽

The Impact ◽

Concept Drifts

AbstractClass imbalance introduces additional challenges when learning classifiers from concept drifting data streams. Most existing work focuses on designing new algorithms for dealing with the global imbalance ratio and does not consider other data complexities. Independent research on static imbalanced data has highlighted the influential role of local data difficulty factors such as minority class decomposition and presence of unsafe types of examples. Despite often being present in real-world data, the interactions between concept drifts and local data difficulty factors have not been investigated in concept drifting data streams yet. We thoroughly study the impact of such interactions on drifting imbalanced streams. For this purpose, we put forward a new categorization of concept drifts for class imbalanced problems. Through comprehensive experiments with synthetic and real data streams, we study the influence of concept drifts, global class imbalance, local data difficulty factors, and their combinations, on predictions of representative online classifiers. Experimental results reveal the high influence of new considered factors and their local drifts, as well as differences in existing classifiers’ reactions to such factors. Combinations of multiple factors are the most challenging for classifiers. Although existing classifiers are partially capable of coping with global class imbalance, new approaches are needed to address challenges posed by imbalanced data streams.

Download Full-text

Data Imbalance in Autism Pre-Diagnosis Classification Systems: An Experimental Study

Journal of Information & Knowledge Management ◽

10.1142/s0219649220400146 ◽

2020 ◽

Vol 19 (01) ◽

pp. 2040014

Author(s):

Neda Abdelhamid ◽

Arun Padmavathy ◽

David Peebles ◽

Fadi Thabtah ◽

Daymond Goulder-Horobin

Keyword(s):

Class Imbalance ◽

Autism Spectrum ◽

Classification Systems ◽

Therapy Planning ◽

Autism Diagnosis ◽

Minority Class ◽

Data Imbalance ◽

Crucial Problem ◽

The Impact ◽

Diagnosis Classification

Machine learning (ML) is a branch of computer science that is rapidly gaining popularity within the healthcare arena due to its ability to explore large datasets to discover useful patterns that can be interepreted for decision-making and prediction. ML techniques are used for the analysis of clinical parameters and their combinations for prognosis, therapy planning and support and patient management and wellbeing. In this research, we investigate a crucial problem associated with medical applications such as autism spectrum disorder (ASD) data imbalances in which cases are far more than just controls in the dataset. In autism diagnosis data, the number of possible instances is linked with one class, i.e. the no ASD is larger than the ASD, and this may cause performance issues such as models favouring the majority class and undermining the minority class. This research experimentally measures the impact of class imbalance issue on the performance of different classifiers on real autism datasets when various data imbalance approaches are utilised in the pre-processing phase. We employ oversampling techniques, such as Synthetic Minority Oversampling (SMOTE), and undersampling with different classifiers including Naive Bayes, RIPPER, C4.5 and Random Forest to measure the impact of these on the performance of the models derived in terms of area under curve and other metrics. Results pinpoint that oversampling techniques are superior to undersampling techniques, at least for the toddlers’ autism dataset that we consider, and suggest that further work should look at incorporating sampling techniques with feature selection to generate models that do not overfit the dataset.

Download Full-text

Rough Noise-Filtered Easy Ensemble for Software Fault Prediction

10.20944/preprints201805.0248.v1 ◽

2018 ◽

Author(s):

Saman Riaz ◽

Ali Arshad ◽

Licheng Jiao

Keyword(s):

Feature Selection ◽

Rough Set ◽

Rough Set Theory ◽

Predictive Accuracy ◽

Class Imbalance ◽

Fault Prediction ◽

Software Fault Prediction ◽

Software Fault ◽

The Impact ◽

Noisy Examples

Software fault prediction is the very consequent research topic for software quality assurance. Data driven approaches provide robust mechanisms to deal with software fault prediction. However, the prediction performance of the model highly depends on the quality of dataset. Many software datasets suffers from the problem of class imbalance. In this regard, under-sampling is a popular data pre-processing method in dealing with class imbalance problem, Easy Ensemble (EE) present a robust approach to achieve a high classification rate and address the biasness towards majority class samples. However, imbalance class is not the only issue that harms performance of classifiers. Some noisy examples and irrelevant features may additionally reduce the rate of predictive accuracy of the classifier. In this paper, we proposed two-stage data pre-processing which incorporates feature selection and a new Rough set Easy Ensemble scheme. In feature selection stage, we eliminate the irrelevant features by feature ranking algorithm. In the second stage of a new Rough set Easy Ensemble by incorporating Rough K nearest neighbor rule filter (RK) afore executing Easy Ensemble (EE), named RKEE for short. RK can remove noisy examples from both minority and majority class. Experimental evaluation on real-world software projects, such as NASA and Eclipse dataset, is performed in order to demonstrate the effectiveness of our proposed approach. Furthermore, this paper comprehensively investigates the influencing factor in our approach. Such as, the impact of Rough set theory on noise-filter, the relationship between model performance and imbalance ratio etc. comprehensive experiments indicate that the proposed approach shows outstanding performance with significance in terms of area-under-the-curve (AUC).

Download Full-text

Difficulty Factors and Preprocessing in Imbalanced Data Sets: An Experimental Study on Artificial Data

Foundations of Computing and Decision Sciences ◽

10.1515/fcds-2017-0007 ◽

2017 ◽

Vol 42 (2) ◽

pp. 149-176 ◽

Cited By ~ 7

Author(s):

Szymon Wojciechowski ◽

Szymon Wilk

Keyword(s):

Experimental Study ◽

Class Imbalance ◽

Imbalanced Data ◽

Classification Performance ◽

Data Sets ◽

Artificial Data ◽

Minority Class ◽

Imbalanced Data Sets ◽

The Impact

Abstract In this paper we describe results of an experimental study where we checked the impact of various difficulty factors in imbalanced data sets on the performance of selected classifiers applied alone or combined with several preprocessing methods. In the study we used artificial data sets in order to systematically check factors such as dimensionality, class imbalance ratio or distribution of specific types of examples (safe, borderline, rare and outliers) in the minority class. The results revealed that the latter factor was the most critical one and it exacerbated other factors (in particular class imbalance). The best classification performance was demonstrated by non-symbolic classifiers, particular by k-NN classifiers (with 1 or 3 neighbors - 1NN and 3NN, respectively) and by SVM. Moreover, they benefited from different preprocessing methods - SVM and 1NN worked best with undersampling, while oversampling was more beneficial for 3NN.

Download Full-text

A PAPER SURVEY ON THE IMPACT OF (IT) ON THE QUALITY OF FINANCIAL DATA (CASE STUDY OF MUNICIPALITIES AND TOURISM MINISTRY IN KURDISTAN REGION – IRAQ)

Qalaai Zanist Scientific Journal ◽

10.25212/lfu.qzj.2.2.06 ◽

2017 ◽

Vol 2 (2) ◽

Author(s):

Firas N. Mardan ◽

Farah Qasim Ahmed

Keyword(s):

Financial Data ◽

Kurdistan Region ◽

Paper Survey ◽

The Impact

Download Full-text

Public reporting of performance measures in long-term care in Canada: does it make a difference?

European Journal of Public Health ◽

10.1093/eurpub/ckaa165.101 ◽

2020 ◽

Vol 30 (Supplement_5) ◽

Author(s):

M Poldrugovac ◽

J E Amuah ◽

H Wei-Randall ◽

P Sidhom ◽

K Morris ◽

...

Keyword(s):

Performance Measures ◽

Performance Indicators ◽

Long Term Care ◽

Public Reporting ◽

Policy Implications ◽

Term Care ◽

Factors Affecting ◽

Care Facilities ◽

The Impact

Abstract Background Evidence of the impact of public reporting of healthcare performance on quality improvement is not yet sufficient to draw conclusions with certainty, despite the important policy implications. This study explored the impact of implementing public reporting of performance indicators of long-term care facilities in Canada. The objective was to analyse whether improvements can be observed in performance measures after publication. Methods We considered 16 performance indicators in long-term care in Canada, 8 of which are publicly reported at a facility level, while the other 8 are privately reported. We analysed data from the Continuing Care Reporting System managed by the Canadian Institute for Health Information and based on information collection with RAI-MDS 2.0 © between the fiscal years 2011 and 2018. A multilevel model was developed to analyse time trends, before and after publication, which started in 2015. The analysis was also stratified by key sample characteristics, such as the facilities' jurisdiction, size, urban or rural location and performance prior to publication. Results Data from 1087 long-term care facilities were included. Among the 8 publicly reported indicators, the trend in the period after publication did not change significantly in 5 cases, improved in 2 cases and worsened in 1 case. Among the 8 privately reported indicators, no change was observed in 7, and worsening in 1 indicator. The stratification of the data suggests that for those indicators that were already improving prior to public reporting, there was either no change in trend or there was a decrease in the rate of improvement after publication. For those indicators that showed a worsening trend prior to public reporting, the contrary was observed. Conclusions Our findings suggest public reporting of performance data can support change. The trends of performance indicators prior to publication appear to have an impact on whether further change will occur after publication. Key messages Public reporting is likely one of the factors affecting change in performance in long-term care facilities. Public reporting of performance measures in long-term care facilities may support improvements in particular in cases where improvement was not observed before publication.

Download Full-text

Applying the risk principle to optimize accuracy and equity in correctional risk assessment: Results From a Simulation

International Journal of Offender Therapy and Comparative Criminology ◽

10.1177/0306624x20986523 ◽

2021 ◽

pp. 0306624X2098652

Author(s):

Grant Duwe

Keyword(s):

Risk Assessment ◽

Predictive Accuracy ◽

Ethnic Disparities ◽

Simulated Data ◽

Risk Assessments ◽

Risk Level ◽

Risk Levels ◽

Risk Principle ◽

Changes In Risk ◽

The Impact

As the use of risk assessments for correctional populations has grown, so has concern that these instruments exacerbate existing racial and ethnic disparities. While much of the attention arising from this concern has focused on how algorithms are designed, relatively little consideration has been given to how risk assessments are used. To this end, the present study tests whether application of the risk principle would help preserve predictive accuracy while, at the same time, mitigate disparities. Using a sample of 9,529 inmates released from Minnesota prisons who had been assessed multiple times during their confinement on a fully-automated risk assessment, this study relies on both actual and simulated data to examine the impact of program assignment decisions on changes in risk level from intake to release. The findings showed that while the risk principle was used in practice to some extent, the simulated results showed that greater adherence to the risk principle would increase reductions in risk levels and minimize the disparities observed at intake. The simulated data further revealed the most favorable outcomes would be achieved by not only applying the risk principle, but also by expanding program capacity for the higher-risk inmates in order to adequately reduce their risk.

Download Full-text

Differences in Disclosure of Integrated Reports at Energy and Non-Energy Companies

Energies ◽

10.3390/en14051253 ◽

2021 ◽

Vol 14 (5) ◽

pp. 1253

Author(s):

Maja Piesiewicz ◽

Marlena Ciechan-Kujawa ◽

Paweł Kufel

Keyword(s):

Ownership Structure ◽

Value Creation ◽

Energy Sector ◽

Financial Data ◽

Integrated Reporting ◽

Creation Process ◽

And Performance ◽

Energy Companies ◽

The Impact ◽

Value Creation Process

Integrated reports combine financial and non-financial data into a comprehensive report outlining the company’s value creation process. Our objective is to find the completeness of disclosures, which is a crucial aspect of an integrated report’s quality. This study contributes to the integrated reporting examination by identifying quantitative and qualitative gaps when applying Integrated Reporting standards, focusing on the energy sector. We conducted the study on 57 published integrated reports of listed companies in Poland. The content of each report was examined for 49 features divided into eight areas. We identify the strengths and weaknesses of current reporting performance and the impact of the company’s sector on reports’ quality. We noted that there are significant differences among the areas. The major problems concern implementing IIRC’s framework on the connections between the business model and the organization’s strategy, risks, opportunities, and performance. Our research also noted that the level of specific disclosures might be related to a company’s ownership structure. We investigated the significance of differences among companies from the energy and non-energy sectors using statistical methods. As a result of the study, we obtained that disclosures’ completeness depends on the operation sector. The companies in the energy sector publish higher-quality integrated reports than companies in the other sectors.

Download Full-text