Problems of KDD Cup 99 Dataset Existed and Data Preprocessing
KDD Cup 99 dataset is not only the most widely used dataset in intrusion detection, but also the de facto benchmark on evaluating the performance merits of intrusion detection system. Nevertheless there are a lot of issues in this dataset which cannot be omitted. In order to establish good data mining models in intrusion detection and find the appropriate network intrusion attack types’ features, researchers should have a well-known understanding on this dataset. In this paper, first and foremost we have made an in-depth analysis on the problems which the dataset are existed, and given the related solutions. Secondly, we also have carried out plenty data preprocessing on the 10% subset of KDD Cup 99 dataset’s training set, giving better results to the following process. What’s more, by comparing 10 common kinds of data mining algorithms in our experiment, we have analyzed and summarized that data preprocessing plays a vital role on the performance and importance to data mining algorithms.