Missing value imputation for gene expression data: computational techniques to recover missing data from available information

AbstractHigh dimensional data like gene expression and RNA-sequences often contain missing values. The subsequent analysis and results based on these incomplete data can suffer strongly from the presence of these missing values. Several approaches to imputation of missing values in gene expression data have been developed but the task is difficult due to the high dimensionality (number of genes) of the data. Here an imputation procedure is proposed that uses weighted nearest neighbors. Instead of using nearest neighbors defined by a distance that includes all genes the distance is computed for genes that are apt to contribute to the accuracy of imputed values. The method aims at avoiding the curse of dimensionality, which typically occurs if local methods as nearest neighbors are applied in high dimensional settings. The proposed weighted nearest neighbors algorithm is compared to existing missing value imputation techniques like mean imputation, KNNimpute and the recently proposed imputation by random forests. We use RNA-sequence and microarray data from studies on human cancer to compare the performance of the methods. The results from simulations as well as real studies show that the weighted distance procedure can successfully handle missing values for high dimensional data structures where the number of predictors is larger than the number of samples. The method typically outperforms the considered competitors.

Download Full-text

An efficient technique for missing value imputation in microarray gene expression data

Proceedings of IEEE International Conference on Computer Communication and Systems ICCCS14 ◽

10.1109/icccs.2014.7068171 ◽

2014 ◽

Author(s):

P. Valarmathie ◽

K. Dinakaran

Keyword(s):

Gene Expression ◽

Gene Expression Data ◽

Microarray Gene Expression Data ◽

Efficient Technique ◽

Expression Data ◽

Microarray Gene Expression ◽

Missing Value ◽

Missing Value Imputation ◽

Microarray Gene

Download Full-text

Comparison of estimation methods for missing value imputation of gene expression data

2016 Medical Technologies National Congress (TIPTEKNO) ◽

10.1109/tiptekno.2016.7863090 ◽

2016 ◽

Author(s):

Ali Sarikas ◽

Niyazi Odabasioglu ◽

Gokmen Altay

Keyword(s):

Gene Expression ◽

Gene Expression Data ◽

Estimation Methods ◽

Expression Data ◽

Missing Value ◽

Missing Value Imputation

Download Full-text

A novel biclustering based missing value prediction method for microarray gene expression data

2015 International Conference on Man and Machine Interfacing (MAMI) ◽

10.1109/mami.2015.7456603 ◽

2015 ◽

Cited By ~ 3

Author(s):

Samiran Chattopadhyay ◽

Chandra Das ◽

Shilpi Bose

Keyword(s):

Gene Expression ◽

Gene Expression Data ◽

Prediction Method ◽

Microarray Gene Expression Data ◽

Expression Data ◽

Microarray Gene Expression ◽

Missing Value ◽

Value Prediction ◽

Microarray Gene ◽

Missing Value Prediction

Download Full-text

Missing data imputation using Evolutionary k- Nearest neighbor algorithm for gene expression data

2016 Sixteenth International Conference on Advances in ICT for Emerging Regions (ICTer) ◽

10.1109/icter.2016.7829911 ◽

2016 ◽

Cited By ~ 4

Author(s):

Hiroshi de Silva ◽

A. Shehan Perera

Keyword(s):

Gene Expression ◽

Missing Data ◽

Gene Expression Data ◽

Nearest Neighbor ◽

Expression Data ◽

Data Imputation ◽

K Nearest Neighbor ◽

Nearest Neighbor Algorithm ◽

Missing Data Imputation ◽

K Nearest Neighbor Algorithm

Download Full-text

An Improved Fuzzy Based Missing Value Estimation in DNA Microarray Validated by Gene Ranking

Advances in Fuzzy Systems ◽

10.1155/2016/6134736 ◽

2016 ◽

Vol 2016 ◽

pp. 1-19 ◽

Cited By ~ 2

Author(s):

Sujay Saha ◽

Anupam Ghosh ◽

Dibyendu Bikash Seal ◽

Kashi Nath Dey

Keyword(s):

Gene Expression ◽

Gene Expression Data ◽

Missing Values ◽

Breast Cancer Dataset ◽

Expression Data ◽

Cancer Dataset ◽

Gene Ranking ◽

Missing Value ◽

Value Estimation ◽

Missing Value Estimation

Most of the gene expression data analysis algorithms require the entire gene expression matrix without any missing values. Hence, it is necessary to devise methods which would impute missing data values accurately. There exist a number of imputation algorithms to estimate those missing values. This work starts with a microarray dataset containing multiple missing values. We first apply the modified version of the fuzzy theory based existing method LRFDVImpute to impute multiple missing values of time series gene expression data and then validate the result of imputation by genetic algorithm (GA) based gene ranking methodology along with some regular statistical validation techniques, like RMSE method. Gene ranking, as far as our knowledge, has not been used yet to validate the result of missing value estimation. Firstly, the proposed method has been tested on the very popular Spellman dataset and results show that error margins have been drastically reduced compared to some previous works, which indirectly validates the statistical significance of the proposed method. Then it has been applied on four other 2-class benchmark datasets, like Colorectal Cancer tumours dataset (GDS4382), Breast Cancer dataset (GSE349-350), Prostate Cancer dataset, and DLBCL-FL (Leukaemia) for both missing value estimation and ranking the genes, and the results show that the proposed method can reach 100% classification accuracy with very few dominant genes, which indirectly validates the biological significance of the proposed method.

Download Full-text