Computational Intelligence for Missing Data Imputation, Estimation, and Management
Latest Publications


TOTAL DOCUMENTS

13
(FIVE YEARS 0)

H-INDEX

1
(FIVE YEARS 0)

Published By IGI Global

9781605663364, 9781605663371

Author(s):  
Tshilidzi Marwala

The problem of missing data in databases has recently been dealt with through the use computational intelligence. The hybrid of auto-associative neural networks and genetic algorithms has proven to be a successful approach to missing data imputation. Similarly, two auto-associative neural networks are developed to be used in conjunction with genetic algorithm to estimate missing data, and these approaches are compared to a Bayesian auto-associative neural network and genetic algorithm approach. One technique combines three neural networks to form a hybrid auto-associative network, while the other merges principal component analysis and neural networks. The hybrid of the neural network and genetic algorithm approach proves to be the most accurate when estimating one missing value, while a hybrid of principal component and neural networks is more consistent and captures patterns in the data more efficiently.


Author(s):  
Tshilidzi Marwala

Neural networks are used in this chapter for classifying the HIV status of individuals based on socioeconomic and demographic characteristics. The trained network is then used to create an error equation with one of the demographic variables as a missing input and the desired HIV status as one of the variables. The missing variable thus becomes a control variable. This control mechanism is proposed to assess the effect of education level on the HIV risk of individuals and, thereby, assist in understanding the extent to which the spread of HIV can be controlled by using the education level. An inverse neural network model and a missing data approximation model based on autoassociative neural network and genetic algorithm (ANNGA) are used for the control mechanism. Therefore, the ANNGA is used to obtain the missing input values (education level) for the first model and an inverse neural network model is then used to obtain the missing input values (education) for the second model. The two models are then compared and it is found that the proposed inverse neural network model outperforms the ANNGA model. The methodology thus shows that HIV spread can be controlled to some extent by modifying a demographic characteristic educational level.


Author(s):  
Tshilidzi Marwala

This chapter is divided into three parts: The first part presents a computational intelligence approach for predicting missing data in the presence of concept drift using an ensemble of multi-layered feed-forward neural networks. An algorithm that detects concept drift by measuring heteroskedasticity is proposed. Six instances prior to the occurrence of missing data are used to approximate the missing values. The algorithm is applied to simulated time series data sets resembling non-stationary data from a sensor. Results show that the prediction of missing data in non-stationary time series data is possible but is still a challenge. In the second part, an algorithm that uses dynamic programming and neural networks to solve the problem of missing data imputation is presented. A model that uses autoassociative neural networks and genetic algorithms is used as a basis; however, the neural networks are not trained using the entire data set. Data are broken up into granules and various models are created. The models are tested on a real dataset and the results show that the proposed method is effective in missing data estimation. In the third part of this chapter, a study of the impact of missing data estimation on fault classification in mechanical systems is undertaken. The fault classification task is implemented using the extension network as well as Gaussian mixture models. When the imputed values are used in the classification of faults using the extension networks, the fault classification accuracy of 95% is observed for single-missing-entry cases and 92% for two-missing-entry cases while the full database set is able to give classification accuracy of 97%. On the other hand, the Gaussian mixture model gives 94% for single-missing-entry cases and 92% for two-missing-entry cases while the full database set is able to give classification accuracy of 96%.


Author(s):  
Tshilidzi Marwala

In this chapter, a classifier technique that is based on a missing data estimation framework that uses autoassociative multi-layer perceptron neural networks and genetic algorithms is proposed. The proposed method is tested on a set of demographic properties of individuals obtained from the South African antenatal survey and compared to conventional feed-forward neural networks. The missing data approach based on the autoassociative network model proposed gives an accuracy of 92%, when compared to the accuracy of 84% obtained from the conventional feed-forward neural network models. The area under the receiver operating characteristics curve for the proposed autoassociative network model is 0.86 compared to 0.80 for the conventional feed-forward neural network model. The autoassociative network model proposed in this chapter, therefore, outperforms the conventional feed-forward neural network models and is an improved classifier. The reasons for this are: (1) the propagation of errors in the autoassociative network model is more distributed while for a conventional feed-forward network is more concentrated; and (2) there is no causality between the demographic properties and the HIV and, therefore, the HIV status does change the demographic properties and vice versa. Therefore, it is better to treat the problem as a missing data problem rather than a feed-forward problem.


Author(s):  
Tshilidzi Marwala

This chapter presents various optimization methods to optimize the missing data error equation, which is made out of the autoassociative neural networks with missing values as design variables. The four optimization techniques that are used are: genetic algorithm, particle swarm optimization, hill climbing and simulated annealing. These optimization methods are tested on two datasets, namely, the beer taster dataset and the fault identification dataset. The results that are obtained are then compared. For these datasets, the results indicate that genetic algorithm approach produced the highest accuracy when compared to simulated annealing and particle swarm optimization. However, the results of these four optimization methods are the same order of magnitude while hill climbing produces the lowest accuracy.


Author(s):  
Tshilidzi Marwala

Missing data creates various problems in analyzing and processing data in databases. In this chapter, a method aimed at approximating missing data in a database that uses a combination of genetic algorithms and neural networks is introduced. The presented method uses genetic algorithms to minimize an error function derived from an auto-associative neural network. The Multi-Layer Perceptron (MLP) and Radial Basis Function (RBF) networks are employed to form an auto-associative network. An investigation is undertaken into using the method to predict missing data accurately as the number of missing cases within a single record increases. It is observed that there is no significant reduction in the accuracy of the results as the number of missing cases in a single record increases. It is also found that results obtained from using the MLP are better than from the RBF for the data used.


Author(s):  
Tshilidzi Marwala

In this chapter, the traditional missing data imputation issues such as missing data patterns and mechanisms are described. Attention is paid to the best models to deal with particular missing data mechanisms. A review of traditional missing data imputation methods, namely case deletion and prediction rules, is conducted. For case deletion, list-wise and pair-wise deletions are reviewed. In addition, for prediction rules, the imputation techniques such as mean substitution, hot-deck, regression and decision trees are also reviewed. Two missing data examples are studied, namely: the Sudoku puzzle and a mechanical system. The major conclusions drawn from these examples are that there is a need for an accurate model that describes inter-relationships and rules that define the data and that a good optimization method is required for a successful missing data estimation procedure.


Author(s):  
Tshilidzi Marwala

Two sets of hybrid techniques have recently emerged for the imputation of missing data. These are, first, the combination of the Gaussian Mixtures Model and the Expectation Maximization algorithms (the GMM-EM) and second, the combination of Auto-Associative Neural Networks with Evolutionary Optimization (the AANN-EO). In this chapter, the evolutionary optimization method implemented is the particle swarm optimization method (the AANN-PSO). Both the GMM-EM and AANN-EO techniques have been discussed individually and their merits discussed at length in the available literature. This chapter provides a comparison between these techniques, using datasets from an industrial power plant, an industrial winding process and an HIV sero-prevalence survey. The results show that GMMEM method is suitable and performs better in cases where there is little or no interdependency between the input variables, whereas the AANN-PSO combination is suitable when there are inherent nonlinear relationships between some of the given variables.


Author(s):  
Tshilidzi Marwala

This chapter develops and compares the merits of three different data imputation models by using accuracy measures. The three methods are auto-associative neural networks, a principal component analysis and support vector regression all combined with cultural genetic algorithms to impute missing variables. The use of a principal component analysis improves the overall performance of the auto-associative network while the use of support vector regression shows promising potential for future investigation. Imputation accuracies up to 97.4% for some of the variables are achieved.


Author(s):  
Tshilidzi Marwala

A number of techniques for handling missing data have been presented and implemented. Most of these proposed techniques are unnecessarily complex and, therefore, difficult to use. This chapter investigates a hot-deck data imputation method, based on rough set computations. In this chapter, characteristic relations are introduced that describe incompletely specified decision tables and then these are used for missing data estimation. It has been shown that the basic rough set idea of lower and upper approximations for incompletely specified decision tables may be defined in a variety of different ways. Empirical results obtained using real data are given and they provide a valuable insight into the problem of missing data. Missing data are predicted with an accuracy of up to 99%.


Sign in / Sign up

Export Citation Format

Share Document