Improving Machine Learning Prediction of Peatlands Fire Occurrence for Unbalanced Data Using SMOTE Approach

It is very difficult for us to accurately predict occurrence of a fire. But, this is very important to protect human life and property. So, we study fire hazard prediction and evaluation methods to cope with fire risks. In this paper, we propose three models based on statistical machine learning and optimized risk indexing for fire risk assessment. We build logistic regression, deep neural networks (DNN) and fire risk indexing models, and verify performances between proposed and traditional models using real investigated data related to fire occurrence in Korea. In general, fire prediction models currently in use do not provide satisfactory levels of accuracy. The reason for this result is that the factors affecting fire occurrence are very diverse and frequency of fire occurrence is very sparse. To improve accuracy of fire occurrence, we first build logistic regression and DNN models. In addition, we construct a fire risk indexing model for a more improved model of fire prediction. To illustrate comparison results between our research models and current fire prediction model, we use real fire data investigated in Korea between 2011 to 2017. From the experimental results of this paper, we can confirm that accuracy of prediction by the proposed method is superior to the existing fire occurrence prediction model. Therefore, we expect the proposed model to contribute to evaluating the possibility of fire risk in buildings and factories in the field of fire insurance and to calculate the fire insurance premium.

Download Full-text

Robust Semi-Supervised Traffic Sign Recognition via Self-Training and Weakly-Supervised Learning

Sensors ◽

10.3390/s20092684 ◽

2020 ◽

Vol 20 (9) ◽

pp. 2684 ◽

Cited By ~ 3

Author(s):

Obed Tettey Nartey ◽

Guowu Yang ◽

Sarpong Kwadwo Asare ◽

Jinzhao Wu ◽

Lady Nadia Frempong

Keyword(s):

Machine Learning ◽

Computer Vision ◽

Supervised Learning ◽

Machine Learning Algorithms ◽

Unbalanced Data ◽

Traffic Sign Recognition ◽

Training Set ◽

Traffic Sign ◽

Sign Recognition ◽

Weakly Supervised

Traffic sign recognition is a classification problem that poses challenges for computer vision and machine learning algorithms. Although both computer vision and machine learning techniques have constantly been improved to solve this problem, the sudden rise in the number of unlabeled traffic signs has become even more challenging. Large data collation and labeling are tedious and expensive tasks that demand much time, expert knowledge, and fiscal resources to satisfy the hunger of deep neural networks. Aside from that, the problem of having unbalanced data also poses a greater challenge to computer vision and machine learning algorithms to achieve better performance. These problems raise the need to develop algorithms that can fully exploit a large amount of unlabeled data, use a small amount of labeled samples, and be robust to data imbalance to build an efficient and high-quality classifier. In this work, we propose a novel semi-supervised classification technique that is robust to small and unbalanced data. The framework integrates weakly-supervised learning and self-training with self-paced learning to generate attention maps to augment the training set and utilizes a novel pseudo-label generation and selection algorithm to generate and select pseudo-labeled samples. The method improves the performance by: (1) normalizing the class-wise confidence levels to prevent the model from ignoring hard-to-learn samples, thereby solving the imbalanced data problem; (2) jointly learning a model and optimizing pseudo-labels generated on unlabeled data; and (3) enlarging the training set to satisfy the hunger of deep learning models. Extensive evaluations on two public traffic sign recognition datasets demonstrate the effectiveness of the proposed technique and provide a potential solution for practical applications.

Download Full-text

Infection status outcome, machine learning method and virus type interact to affect the optimised prediction of hepatitis virus immunoassay results from routine pathology laboratory assays in unbalanced data

BMC Bioinformatics ◽

10.1186/1471-2105-14-206 ◽

2013 ◽

Vol 14 (1) ◽

Cited By ~ 10

Author(s):

Alice M Richardson ◽

Brett A Lidbury

Keyword(s):

Machine Learning ◽

Virus Type ◽

Hepatitis Virus ◽

Unbalanced Data ◽

Machine Learning Method ◽

Learning Method ◽

Infection Status ◽

Pathology Laboratory

Download Full-text

Intelligent methods for improving the accuracy of prediction of rare hazardous events in railway transportation

Dependability ◽

10.21683/1729-2646-2021-21-3-54-65 ◽

2021 ◽

Vol 21 (3) ◽

pp. 54-64

Author(s):

O. B. Pronevich ◽

M. V. Zaitsev

Keyword(s):

Machine Learning ◽

Initial Data ◽

Power Supply ◽

Rare Event ◽

Unbalanced Data ◽

Classification Models ◽

Railway Transportation ◽

Event Classification ◽

Hazardous Event

The paper Aims to examine various approaches to the ways of improving the quality of predictions and classification of unbalanced data that allow improving the accuracy of rare event classification. When predicting the onset of rare events using machine learning techniques, researchers face the problem of inconsistency between the quality of trained models and their actual ability to correctly predict the occurrence of a rare event. The paper examines model training under unbalanced initial data. The subject of research is the information on incidents and hazardous events at railway power supply facilities. The problem of unbalanced data is expressed in the noticeable imbalance between the types of observed events, i.e., the numbers of instances. Methods. While handling unbalanced data, depending on the nature of the problem at hand, the quality and size of the initial data, various Data Science-based techniques of improving the quality of classification models and prediction are used. Some of those methods are focused on attributes and parameters of classification models. Those include FAST, CFS, fuzzy classifiers, GridSearchCV, etc. Another group of methods is oriented towards generating representative subsets out of initial datasets, i.e., samples. Data sampling techniques allow examining the effect of class proportions on the quality of machine learning. In particular, in this paper, the NearMiss method is considered in detail. Results. The problem of class imbalance in respect to the analysis of the number of incidents at railway facilities has existed since 2015. Despite the decreasing share of hazardous events at railway power supply facilities in the three years since 2018, an increase in the number of such events cannot be ruled out. Monthly statistics of hazardous event distribution exhibit no trend for declines and peaks. In this context, the optimal period of observation of the number of incidents and hazardous events is a month. A visualization of the class ratio has shown the absence of a clear boundary between the members of the majority class (incidents) and those of the minority class (hazardous events). The class ratio was studied in two and three dimensions, in actual values and using the method of main components. Such “proximity” of classes is one of the causes of wrong predictions. In this paper, the authors analysed past research of the ways of improving the quality of machine learning based on unbalanced data. The terms that describe the degree of class imbalances have been defined and clarified. The strengths and weaknesses of 50 various methods of handling such data were studied and set forth. Out of the set of methods of handling the numbers of class members as part of the classification (prediction of the occurrence) of rare hazardous events in railway transportation, the NearMiss method was chosen. It allows experimenting with the ratios and methods of selecting class members. As the results of a series of experiments, the accuracy of rare hazardous event classification was improved from 0 to 70-90%.

Download Full-text

Detection of fraudulent credit card transactions: A comparative analysis of data sampling and classification techniques

Journal of Physics Conference Series ◽

10.1088/1742-6596/2161/1/012072 ◽

2022 ◽

Vol 2161 (1) ◽

pp. 012072

Author(s):

Konduri Praveen Mahesh ◽

Shaik Ashar Afrouz ◽

Anu Shaju Areeckal

Keyword(s):

Machine Learning ◽

Credit Card ◽

Research Problem ◽

Machine Learning Algorithms ◽

Support Vector ◽

Unbalanced Data ◽

Learning Approaches ◽

Data Sampling ◽

Sampled Data ◽

Under Sampling

Abstract Every year there is an increasing loss of a huge amount of money due to fraudulent credit card transactions. Recently there is a focus on using machine learning algorithms to identify fraud transactions. The number of fraud cases to non-fraud transactions is very low. This creates a skewed or unbalanced data, which poses a challenge to training the machine learning models. The availability of a public dataset for this research problem is scarce. The dataset used for this work is obtained from Kaggle. In this paper, we explore different sampling techniques such as under-sampling, Synthetic Minority Oversampling Technique (SMOTE) and SMOTE-Tomek, to work on the unbalanced data. Classification models, such as k-Nearest Neighbour (KNN), logistic regression, random forest and Support Vector Machine (SVM), are trained on the sampled data to detect fraudulent credit card transactions. The performance of the various machine learning approaches are evaluated for its precision, recall and F1-score. The classification results obtained is promising and can be used for credit card fraud detection.

Download Full-text

Investigation of Machine Learning Models and Different Feature Sets for the Efficiency of Early Sepsis Prediction from Highly Unbalanced Data

10.20944/preprints202005.0205.v1 ◽

2020 ◽

Author(s):

Vytautas Abromavičius ◽

Darius Plonis ◽

Deividas Tarasevičius ◽

Artūras Serackis

Keyword(s):

Machine Learning ◽

Intensive Care Unit ◽

Intensive Care ◽

Early Detection ◽

Performance Metrics ◽

Unbalanced Data ◽

Clinical Criteria ◽

Unbalanced Dataset ◽

Model Training ◽

Clinical Records

The presented research faces the problem of early detection of sepsis for patients in the Intensive Care Unit. The PhysioNet/Computing in Cardiology Challenge 2019 facilitated the development of automated, open-source algorithms for the early detection of sepsis from clinical data. A labeled clinical records dataset for training and verification of the algorithms was provided by the challenge organizers. However, a relatively small number of records with sepsis, supported by Sepsis-3 clinical criteria, led to highly unbalanced dataset (only 2% records with sepsis label). A high number of unbalanced data records is a great challenge for machine learning model training and is not suitable for training classical classifiers. To address these issues, a number of various models were investigated. A solution including feature selection and data balancing techniques was proposed in this paper. In addition, several performance metrics were investigated. Results show, that for successful prediction, a particular model having few or more predictors based on the length of stay in the Intensive Care Unit should be applied.

Download Full-text