Overview of PAKDD Competition 2007

Strategic Advancements in Utilizing Data Mining and Warehousing Technologies ◽

10.4018/978-1-60566-717-1.ch015 ◽

2011 ◽

pp. 277-284

Author(s):

Zhang Junping ◽

Li Guo-Zheng

Keyword(s):

Ensemble Learning ◽

Credit Card ◽

Missing Values ◽

Data Preparation ◽

Data Set ◽

Statistical Results

The PAKDD Competition 2007 involved the problem of predicting customers’ propensity to take up a home loan when a collection of data from credit card users are provided. It is rather difficult to address the problem because 1) the data set is extremely imbalanced; 2) the features are mixture types; and 3) there are many missing values. This article gives an overview on the competition, mainly consisting of three parts: 1) The background of the database and some statistical results of participants are introduced; 2) An analysis from the viewpoint of data preparation, resampling/reweighting and ensemble learning employed by different participants is given; and 3) Finally, some business insights are highlighted.

Download Full-text

Credit Card Fraud Detection in Data Mining using XGBoost Classifier

International Journal of Recent Technology and Engineering - 2 ◽

10.35940/ijrte.f8182.059120 ◽

2020 ◽

Vol 9 (1) ◽

pp. 603-608

Keyword(s):

Data Mining ◽

Credit Card ◽

Missing Values ◽

Real Life ◽

Fraud Detection ◽

Data Sets ◽

Global Business ◽

Data Set ◽

Credit Card Fraud ◽

Fraudulent Activity

In today's economy, credit card (CC) plays a major role. It is an inevitable part of a household, business & global business. While using CCs can offer huge advantages if used cautiously and safely, significant credit & financial damage can be incurred by fraudulent activity. Several methods to deal with the rising credit card fraud (CCF) have been suggested. Both such strategies, though, are meant to prevent CCFs; each of them has its own drawbacks, benefits, and functions. CCF has become a significant global concern because of the huge growth of e-commerce and the proliferation of payment online. Machine learning (ML) algo as a data mining technology (DM) was recently very involved in the detection of CCF. There are however several challenges, including the absence of publicly available data sets, high unbalanced size, and different confusing behavior. In this paper, we discuss the state of the art in credit card fraud detection (CCFD), dataset and assessment standards after analyzing issues with the CCFD. Dataset is publicly available in the CCFD data set used in experiments. Here, we compare two ML algos of performance: Logistic Regression (LR) and XGBoost in detecting CCF Transactions Real Life Data. XGBoosthas an inherent ability to handle missing values. When XGBoost encounters node at lost value, it tries to split left & right hands & learn all ways to the highest loss. This is when the test runs on the data. The experimental results show an effective use of the XGBoost classifier. Technique of performance is widely accepted metric based on exclusion: accuracy & recall. Also, the comparison between both approaches displayed based on the ROC curve

Download Full-text

Handling Complex Missing Data Using Random Forest Approach for an Air Quality Monitoring Dataset: A Case Study of Kuwait Environmental Data (2012 to 2018)

International Journal of Environmental Research and Public Health ◽

10.3390/ijerph18031333 ◽

2021 ◽

Vol 18 (3) ◽

pp. 1333

Author(s):

Ahmad R. Alsaber ◽

Jiazhu Pan ◽

Adeeba Al-Hurban

Keyword(s):

Air Quality ◽

Missing Data ◽

Random Forest ◽

Missing Values ◽

Imputation Method ◽

Environmental Data ◽

Environmental Research ◽

Quality Data ◽

Data Set ◽

Air Quality Data

In environmental research, missing data are often a challenge for statistical modeling. This paper addressed some advanced techniques to deal with missing values in a data set measuring air quality using a multiple imputation (MI) approach. MCAR, MAR, and NMAR missing data techniques are applied to the data set. Five missing data levels are considered: 5%, 10%, 20%, 30%, and 40%. The imputation method used in this paper is an iterative imputation method, missForest, which is related to the random forest approach. Air quality data sets were gathered from five monitoring stations in Kuwait, aggregated to a daily basis. Logarithm transformation was carried out for all pollutant data, in order to normalize their distributions and to minimize skewness. We found high levels of missing values for NO2 (18.4%), CO (18.5%), PM10 (57.4%), SO2 (19.0%), and O3 (18.2%) data. Climatological data (i.e., air temperature, relative humidity, wind direction, and wind speed) were used as control variables for better estimation. The results show that the MAR technique had the lowest RMSE and MAE. We conclude that MI using the missForest approach has a high level of accuracy in estimating missing values. MissForest had the lowest imputation error (RMSE and MAE) among the other imputation methods and, thus, can be considered to be appropriate for analyzing air quality data.

Download Full-text

Symmetry Breaking and Training from Incomplete Data with Radial Basis Boltzmann Machines

International Journal of Neural Systems ◽

10.1142/s0129065797000318 ◽

1997 ◽

Vol 08 (03) ◽

pp. 301-315 ◽

Cited By ~ 8

Author(s):

Marcel J. Nijman ◽

Hilbert J. Kappen

Keyword(s):

Symmetry Breaking ◽

Incomplete Data ◽

Missing Values ◽

Nearest Neighbor ◽

Boltzmann Machine ◽

K Nearest Neighbor ◽

Data Set ◽

Input Space ◽

Learning Rules ◽

Radial Basis

A Radial Basis Boltzmann Machine (RBBM) is a specialized Boltzmann Machine architecture that combines feed-forward mapping with probability estimation in the input space, and for which very efficient learning rules exist. The hidden representation of the network displays symmetry breaking as a function of the noise in the dynamics. Thus, generalization can be studied as a function of the noise in the neuron dynamics instead of as a function of the number of hidden units. We show that the RBBM can be seen as an elegant alternative of k-nearest neighbor, leading to comparable performance without the need to store all data. We show that the RBBM has good classification performance compared to the MLP. The main advantage of the RBBM is that simultaneously with the input-output mapping, a model of the input space is obtained which can be used for learning with missing values. We derive learning rules for the case of incomplete data, and show that they perform better on incomplete data than the traditional learning rules on a 'repaired' data set.

Download Full-text

Statistical data preparation: management of missing values and outliers

Korean Journal of Anesthesiology ◽

10.4097/kjae.2017.70.4.407 ◽

2017 ◽

Vol 70 (4) ◽

pp. 407 ◽

Cited By ~ 69

Author(s):

Sang Kyu Kwak ◽

Jong Hae Kim

Keyword(s):

Missing Values ◽

Statistical Data ◽

Data Preparation

Download Full-text

An Ensemble Learning Framework for Credit Card Fraud Detection Based on Training Set Partitioning and Clustering

2018 IEEE SmartWorld, Ubiquitous Intelligence & Computing, Advanced & Trusted Computing, Scalable Computing & Communications, Cloud & Big Data Computing, Internet of People and Smart City Innovation (SmartWorld/SCALCOM/UIC/ATC/CBDCom/IOP/SCI) ◽

10.1109/smartworld.2018.00051 ◽

2018 ◽

Cited By ~ 4

Author(s):

Hongyu Wang ◽

Ping Zhu ◽

Xueqiang Zou ◽

Sujuan Qin

Keyword(s):

Ensemble Learning ◽

Credit Card ◽

Fraud Detection ◽

Set Partitioning ◽

Training Set ◽

Credit Card Fraud ◽

Learning Framework

Download Full-text

A Novel Imputation Approach for Sharing Protected Public Health Data

American Journal of Public Health ◽

10.2105/ajph.2021.306432 ◽

2021 ◽

pp. e1-e9

Author(s):

Elizabeth A. Erdman ◽

Leonard D. Young ◽

Dana L. Bernson ◽

Cici Bauer ◽

Kenneth Chui ◽

...

Keyword(s):

Public Health ◽

Missing Values ◽

Mean Squared Error ◽

Absolute Error ◽

Surveillance Data ◽

Prescription Data ◽

Data Set ◽

Opioid Prescription ◽

Mean Imputation ◽

Imputation Approach

Objectives. To develop an imputation method to produce estimates for suppressed values within a shared government administrative data set to facilitate accurate data sharing and statistical and spatial analyses. Methods. We developed an imputation approach that incorporated known features of suppressed Massachusetts surveillance data from 2011 to 2017 to predict missing values more precisely. Our methods for 35 de-identified opioid prescription data sets combined modified previous or next substitution followed by mean imputation and a count adjustment to estimate suppressed values before sharing. We modeled 4 methods and compared the results to baseline mean imputation. Results. We assessed performance by comparing root mean squared error (RMSE), mean absolute error (MAE), and proportional variance between imputed and suppressed values. Our method outperformed mean imputation; we retained 46% of the suppressed value’s proportional variance with better precision (22% lower RMSE and 26% lower MAE) than simple mean imputation. Conclusions. Our easy-to-implement imputation technique largely overcomes the adverse effects of low count value suppression with superior results to simple mean imputation. This novel method is generalizable to researchers sharing protected public health surveillance data. (Am J Public Health. Published online ahead of print September 16, 2021: e1–e9. https://doi.org/10.2105/AJPH.2021.306432 )

Download Full-text

Enhance Network Intrusion Detection System by Exploiting BR Algorithm as an Optimal Feature Selection

Handbook of Research on Threat Detection and Countermeasures in Network Security - Advances in Information Security, Privacy, and Ethics ◽

10.4018/978-1-4666-6583-5.ch002 ◽

2015 ◽

pp. 17-32 ◽

Cited By ~ 1

Author(s):

Soukaena Hassan Hashem

Keyword(s):

Intrusion Detection ◽

Wireless Network ◽

Intrusion Detection System ◽

Missing Values ◽

Detection System ◽

Network Intrusion Detection ◽

Wireless Data ◽

Data Set ◽

Network Intrusion ◽

Network Intrusion Detection System

This chapter aims to build a proposed Wire/Wireless Network Intrusion Detection System (WWNIDS) to detect intrusions and consider many of modern attacks which are not taken in account previously. The proposal WWNIDS treat intrusion detection with just intrinsic features but not all of them. The dataset of WWNIDS will consist of two parts; first part will be wire network dataset which has been constructed from KDD'99 that has 41 features with some modifications to produce the proposed dataset that called modern KDD and to be reliable in detecting intrusion by suggesting three additional features. The second part will be building wireless network dataset by collecting thousands of sessions (normal and intrusion); this proposed dataset is called Constructed Wireless Data Set (CWDS). The preprocessing process will be done on the two datasets (KDD & CWDS) to eliminate some problems that affect the detection of intrusion such as noise, missing values and duplication.

Download Full-text

Outlier Detection Techniques for Data Mining

Encyclopedia of Data Warehousing and Mining, Second Edition ◽

10.4018/978-1-60566-010-3.ch228 ◽

2011 ◽

pp. 1483-1488

Author(s):

Fabrizio Angiulli

Keyword(s):

Data Mining ◽

Outlier Detection ◽

Credit Card ◽

Detection Methods ◽

Distribution Model ◽

Main Task ◽

Data Set ◽

Homogeneous Groups ◽

Definition Of ◽

Dependency Detection

Data mining techniques can be grouped in four main categories: clustering, classification, dependency detection, and outlier detection. Clustering is the process of partitioning a set of objects into homogeneous groups, or clusters. Classification is the task of assigning objects to one of several predefined categories. Dependency detection searches for pairs of attribute sets which exhibit some degree of correlation in the data set at hand. The outlier detection task can be defined as follows: “Given a set of data points or objects, find the objects that are considerably dissimilar, exceptional or inconsistent with respect to the remaining data”. These exceptional objects as also referred to as outliers. Most of the early methods for outlier identification have been developed in the field of statistics (Hawkins, 1980; Barnett & Lewis, 1994). Hawkins’ definition of outlier clarifies the approach: “An outlier is an observation that deviates so much from other observations as to arouse suspicions that it was generated by a different mechanism”. Indeed, statistical techniques assume that the given data set has a distribution model. Outliers are those points that satisfy a discordancy test, that is, that are significantly far from what would be their expected position given the hypothesized distribution. Many clustering, classification and dependency detection methods produce outliers as a by-product of their main task. For example, in classification, mislabeled objects are considered outliers and thus they are removed from the training set to improve the accuracy of the resulting classifier, while in clustering, objects that do not strongly belong to any cluster are considered outliers. Nevertheless, it must be said that searching for outliers through techniques specifically designed for tasks different from outlier detection could not be advantageous. As an example, clusters can be distorted by outliers and, thus, the quality of the outliers returned is affected by their presence. Moreover, other than returning a solution of higher quality, outlier detection algorithms can be vastly more efficient than non ad-hoc algorithms. While in many contexts outliers are considered as noise that must be eliminated, as pointed out elsewhere, “one person’s noise could be another person’s signal”, and thus outliers themselves can be of great interest. Outlier mining is used in telecom or credit card frauds to detect the atypical usage of telecom services or credit cards, in intrusion detection for detecting unauthorized accesses, in medical analysis to test abnormal reactions to new medical therapies, in marketing and customer segmentations to identify customers spending much more or much less than average customer, in surveillance systems, in data cleaning, and in many other fields.

Download Full-text

Credit card fraud identification based on unbalanced data set based on fusion model

2019 IEEE 1st International Conference on Civil Aviation Safety and Information Technology (ICCASIT) ◽

10.1109/iccasit48058.2019.8973167 ◽

2019 ◽

Author(s):

Donglin Li

Keyword(s):

Credit Card ◽

Unbalanced Data ◽

Fusion Model ◽

Data Set ◽

Credit Card Fraud

Download Full-text

Automatic detection of lung cancer from biomedical data set using discrete AdaBoost optimized ensemble learning generalized neural networks

Neural Computing and Applications ◽

10.1007/s00521-018-03972-2 ◽

2019 ◽

Vol 32 (3) ◽

pp. 777-790 ◽

Cited By ~ 19

Author(s):

P. Mohamed Shakeel ◽

Amr Tolba ◽

Zafer Al-Makhadmeh ◽

Mustafa Musa Jaber

Keyword(s):

Lung Cancer ◽

Neural Networks ◽

Ensemble Learning ◽

Automatic Detection ◽

Biomedical Data ◽

Data Set ◽

Generalized Neural Networks

Download Full-text