Overview of PAKDD Competition 2007

Author(s):  
Zhang Junping ◽  
Li Guo-Zheng

The PAKDD Competition 2007 involved the problem of predicting customers’ propensity to take up a home loan when a collection of data from credit card users are provided. It is rather difficult to address the problem because 1) the data set is extremely imbalanced; 2) the features are mixture types; and 3) there are many missing values. This article gives an overview on the competition, mainly consisting of three parts: 1) The background of the database and some statistical results of participants are introduced; 2) An analysis from the viewpoint of data preparation, resampling/reweighting and ensemble learning employed by different participants is given; and 3) Finally, some business insights are highlighted.

In today's economy, credit card (CC) plays a major role. It is an inevitable part of a household, business & global business. While using CCs can offer huge advantages if used cautiously and safely, significant credit & financial damage can be incurred by fraudulent activity. Several methods to deal with the rising credit card fraud (CCF) have been suggested. Both such strategies, though, are meant to prevent CCFs; each of them has its own drawbacks, benefits, and functions. CCF has become a significant global concern because of the huge growth of e-commerce and the proliferation of payment online. Machine learning (ML) algo as a data mining technology (DM) was recently very involved in the detection of CCF. There are however several challenges, including the absence of publicly available data sets, high unbalanced size, and different confusing behavior. In this paper, we discuss the state of the art in credit card fraud detection (CCFD), dataset and assessment standards after analyzing issues with the CCFD. Dataset is publicly available in the CCFD data set used in experiments. Here, we compare two ML algos of performance: Logistic Regression (LR) and XGBoost in detecting CCF Transactions Real Life Data. XGBoosthas an inherent ability to handle missing values. When XGBoost encounters node at lost value, it tries to split left & right hands & learn all ways to the highest loss. This is when the test runs on the data. The experimental results show an effective use of the XGBoost classifier. Technique of performance is widely accepted metric based on exclusion: accuracy & recall. Also, the comparison between both approaches displayed based on the ROC curve


Author(s):  
Ahmad R. Alsaber ◽  
Jiazhu Pan ◽  
Adeeba Al-Hurban 

In environmental research, missing data are often a challenge for statistical modeling. This paper addressed some advanced techniques to deal with missing values in a data set measuring air quality using a multiple imputation (MI) approach. MCAR, MAR, and NMAR missing data techniques are applied to the data set. Five missing data levels are considered: 5%, 10%, 20%, 30%, and 40%. The imputation method used in this paper is an iterative imputation method, missForest, which is related to the random forest approach. Air quality data sets were gathered from five monitoring stations in Kuwait, aggregated to a daily basis. Logarithm transformation was carried out for all pollutant data, in order to normalize their distributions and to minimize skewness. We found high levels of missing values for NO2 (18.4%), CO (18.5%), PM10 (57.4%), SO2 (19.0%), and O3 (18.2%) data. Climatological data (i.e., air temperature, relative humidity, wind direction, and wind speed) were used as control variables for better estimation. The results show that the MAR technique had the lowest RMSE and MAE. We conclude that MI using the missForest approach has a high level of accuracy in estimating missing values. MissForest had the lowest imputation error (RMSE and MAE) among the other imputation methods and, thus, can be considered to be appropriate for analyzing air quality data.


1997 ◽  
Vol 08 (03) ◽  
pp. 301-315 ◽  
Author(s):  
Marcel J. Nijman ◽  
Hilbert J. Kappen

A Radial Basis Boltzmann Machine (RBBM) is a specialized Boltzmann Machine architecture that combines feed-forward mapping with probability estimation in the input space, and for which very efficient learning rules exist. The hidden representation of the network displays symmetry breaking as a function of the noise in the dynamics. Thus, generalization can be studied as a function of the noise in the neuron dynamics instead of as a function of the number of hidden units. We show that the RBBM can be seen as an elegant alternative of k-nearest neighbor, leading to comparable performance without the need to store all data. We show that the RBBM has good classification performance compared to the MLP. The main advantage of the RBBM is that simultaneously with the input-output mapping, a model of the input space is obtained which can be used for learning with missing values. We derive learning rules for the case of incomplete data, and show that they perform better on incomplete data than the traditional learning rules on a 'repaired' data set.


2021 ◽  
pp. e1-e9
Author(s):  
Elizabeth A. Erdman ◽  
Leonard D. Young ◽  
Dana L. Bernson ◽  
Cici Bauer ◽  
Kenneth Chui ◽  
...  

Objectives. To develop an imputation method to produce estimates for suppressed values within a shared government administrative data set to facilitate accurate data sharing and statistical and spatial analyses. Methods. We developed an imputation approach that incorporated known features of suppressed Massachusetts surveillance data from 2011 to 2017 to predict missing values more precisely. Our methods for 35 de-identified opioid prescription data sets combined modified previous or next substitution followed by mean imputation and a count adjustment to estimate suppressed values before sharing. We modeled 4 methods and compared the results to baseline mean imputation. Results. We assessed performance by comparing root mean squared error (RMSE), mean absolute error (MAE), and proportional variance between imputed and suppressed values. Our method outperformed mean imputation; we retained 46% of the suppressed value’s proportional variance with better precision (22% lower RMSE and 26% lower MAE) than simple mean imputation. Conclusions. Our easy-to-implement imputation technique largely overcomes the adverse effects of low count value suppression with superior results to simple mean imputation. This novel method is generalizable to researchers sharing protected public health surveillance data. (Am J Public Health. Published online ahead of print September 16, 2021: e1–e9. https://doi.org/10.2105/AJPH.2021.306432 )


Author(s):  
Soukaena Hassan Hashem

This chapter aims to build a proposed Wire/Wireless Network Intrusion Detection System (WWNIDS) to detect intrusions and consider many of modern attacks which are not taken in account previously. The proposal WWNIDS treat intrusion detection with just intrinsic features but not all of them. The dataset of WWNIDS will consist of two parts; first part will be wire network dataset which has been constructed from KDD'99 that has 41 features with some modifications to produce the proposed dataset that called modern KDD and to be reliable in detecting intrusion by suggesting three additional features. The second part will be building wireless network dataset by collecting thousands of sessions (normal and intrusion); this proposed dataset is called Constructed Wireless Data Set (CWDS). The preprocessing process will be done on the two datasets (KDD & CWDS) to eliminate some problems that affect the detection of intrusion such as noise, missing values and duplication.


Author(s):  
Fabrizio Angiulli

Data mining techniques can be grouped in four main categories: clustering, classification, dependency detection, and outlier detection. Clustering is the process of partitioning a set of objects into homogeneous groups, or clusters. Classification is the task of assigning objects to one of several predefined categories. Dependency detection searches for pairs of attribute sets which exhibit some degree of correlation in the data set at hand. The outlier detection task can be defined as follows: “Given a set of data points or objects, find the objects that are considerably dissimilar, exceptional or inconsistent with respect to the remaining data”. These exceptional objects as also referred to as outliers. Most of the early methods for outlier identification have been developed in the field of statistics (Hawkins, 1980; Barnett & Lewis, 1994). Hawkins’ definition of outlier clarifies the approach: “An outlier is an observation that deviates so much from other observations as to arouse suspicions that it was generated by a different mechanism”. Indeed, statistical techniques assume that the given data set has a distribution model. Outliers are those points that satisfy a discordancy test, that is, that are significantly far from what would be their expected position given the hypothesized distribution. Many clustering, classification and dependency detection methods produce outliers as a by-product of their main task. For example, in classification, mislabeled objects are considered outliers and thus they are removed from the training set to improve the accuracy of the resulting classifier, while in clustering, objects that do not strongly belong to any cluster are considered outliers. Nevertheless, it must be said that searching for outliers through techniques specifically designed for tasks different from outlier detection could not be advantageous. As an example, clusters can be distorted by outliers and, thus, the quality of the outliers returned is affected by their presence. Moreover, other than returning a solution of higher quality, outlier detection algorithms can be vastly more efficient than non ad-hoc algorithms. While in many contexts outliers are considered as noise that must be eliminated, as pointed out elsewhere, “one person’s noise could be another person’s signal”, and thus outliers themselves can be of great interest. Outlier mining is used in telecom or credit card frauds to detect the atypical usage of telecom services or credit cards, in intrusion detection for detecting unauthorized accesses, in medical analysis to test abnormal reactions to new medical therapies, in marketing and customer segmentations to identify customers spending much more or much less than average customer, in surveillance systems, in data cleaning, and in many other fields.


Sign in / Sign up

Export Citation Format

Share Document