Missing value imputation for gene expression data by tailored nearest neighbors

Author(s):  
Shahla Faisal ◽  
Gerhard Tutz

AbstractHigh dimensional data like gene expression and RNA-sequences often contain missing values. The subsequent analysis and results based on these incomplete data can suffer strongly from the presence of these missing values. Several approaches to imputation of missing values in gene expression data have been developed but the task is difficult due to the high dimensionality (number of genes) of the data. Here an imputation procedure is proposed that uses weighted nearest neighbors. Instead of using nearest neighbors defined by a distance that includes all genes the distance is computed for genes that are apt to contribute to the accuracy of imputed values. The method aims at avoiding the curse of dimensionality, which typically occurs if local methods as nearest neighbors are applied in high dimensional settings. The proposed weighted nearest neighbors algorithm is compared to existing missing value imputation techniques like mean imputation, KNNimpute and the recently proposed imputation by random forests. We use RNA-sequence and microarray data from studies on human cancer to compare the performance of the methods. The results from simulations as well as real studies show that the weighted distance procedure can successfully handle missing values for high dimensional data structures where the number of predictors is larger than the number of samples. The method typically outperforms the considered competitors.

2016 ◽  
Vol 2016 ◽  
pp. 1-19 ◽  
Author(s):  
Sujay Saha ◽  
Anupam Ghosh ◽  
Dibyendu Bikash Seal ◽  
Kashi Nath Dey

Most of the gene expression data analysis algorithms require the entire gene expression matrix without any missing values. Hence, it is necessary to devise methods which would impute missing data values accurately. There exist a number of imputation algorithms to estimate those missing values. This work starts with a microarray dataset containing multiple missing values. We first apply the modified version of the fuzzy theory based existing method LRFDVImpute to impute multiple missing values of time series gene expression data and then validate the result of imputation by genetic algorithm (GA) based gene ranking methodology along with some regular statistical validation techniques, like RMSE method. Gene ranking, as far as our knowledge, has not been used yet to validate the result of missing value estimation. Firstly, the proposed method has been tested on the very popular Spellman dataset and results show that error margins have been drastically reduced compared to some previous works, which indirectly validates the statistical significance of the proposed method. Then it has been applied on four other 2-class benchmark datasets, like Colorectal Cancer tumours dataset (GDS4382), Breast Cancer dataset (GSE349-350), Prostate Cancer dataset, and DLBCL-FL (Leukaemia) for both missing value estimation and ranking the genes, and the results show that the proposed method can reach 100% classification accuracy with very few dominant genes, which indirectly validates the biological significance of the proposed method.


Author(s):  
Umu Sa'adah ◽  
Masithoh Yessi Rochayani ◽  
Ani Budi Astuti

<p>Classifying high-dimensional data are a challenging task in data mining. Gene expression data is a type of high-dimensional data that has thousands of features. The study was proposing a method to extract knowledge from high-dimensional gene expression data by selecting features and classifying. Lasso was used for selecting features and the classification and regression tree (CART) algorithm was used to construct the decision tree model. To examine the stability of the lasso decision tree, we performed bootstrap aggregating (Bagging) with 50 replications. The gene expression data used was an ovarian tumor dataset that has 1,545 observations, 10,935 gene features, and binary class. The findings of this research showed that the lasso decision tree could produce an interpretable model that theoretically correct and had an accuracy of 89.32%. Meanwhile, the model obtained from the majority vote gave an accuracy of 90.29% which showed an increase in accuracy of 1% from the single lasso decision tree model. The slightly increasing accuracy shows that the lasso decision tree classifier is stable.</p>


2006 ◽  
Vol 04 (05) ◽  
pp. 935-957 ◽  
Author(s):  
ZHIPENG CAI ◽  
MAYSAM HEYDARI ◽  
GUOHUI LIN

Microarray gene expression data often contains multiple missing values due to various reasons. However, most of gene expression data analysis algorithms require complete expression data. Therefore, accurate estimation of the missing values is critical to further data analysis. In this paper, an Iterated Local Least Squares Imputation (ILLSimpute) method is proposed for estimating missing values. Two unique features of ILLSimpute method are: ILLSimpute method does not fix a common number of coherent genes for target genes for estimation purpose, but defines coherent genes as those within a distance threshold to the target genes. Secondly, in ILLSimpute method, estimated values in one iteration are used for missing value estimation in the next iteration and the method terminates after certain iterations or the imputed values converge. Experimental results on six real microarray datasets showed that ILLSimpute method performed at least as well as, and most of the time much better than, five most recent imputation methods.


Sign in / Sign up

Export Citation Format

Share Document