Missing value imputation for gene expression data by tailored nearest neighbors

AbstractHigh dimensional data like gene expression and RNA-sequences often contain missing values. The subsequent analysis and results based on these incomplete data can suffer strongly from the presence of these missing values. Several approaches to imputation of missing values in gene expression data have been developed but the task is difficult due to the high dimensionality (number of genes) of the data. Here an imputation procedure is proposed that uses weighted nearest neighbors. Instead of using nearest neighbors defined by a distance that includes all genes the distance is computed for genes that are apt to contribute to the accuracy of imputed values. The method aims at avoiding the curse of dimensionality, which typically occurs if local methods as nearest neighbors are applied in high dimensional settings. The proposed weighted nearest neighbors algorithm is compared to existing missing value imputation techniques like mean imputation, KNNimpute and the recently proposed imputation by random forests. We use RNA-sequence and microarray data from studies on human cancer to compare the performance of the methods. The results from simulations as well as real studies show that the weighted distance procedure can successfully handle missing values for high dimensional data structures where the number of predictors is larger than the number of samples. The method typically outperforms the considered competitors.

Download Full-text

Missing value imputation for microarray gene expression data using histone acetylation information

BMC Bioinformatics ◽

10.1186/1471-2105-9-252 ◽

2008 ◽

Vol 9 (1) ◽

Cited By ~ 24

Author(s):

Qian Xiang ◽

Xianhua Dai ◽

Yangyang Deng ◽

Caisheng He ◽

Jiang Wang ◽

...

Keyword(s):

Gene Expression ◽

Histone Acetylation ◽

Gene Expression Data ◽

Microarray Gene Expression Data ◽

Expression Data ◽

Microarray Gene Expression ◽

Missing Value ◽

Missing Value Imputation ◽

Microarray Gene

Download Full-text

A Review on Missing Value Imputation Algorithms for Microarray Gene Expression Data

Current Bioinformatics ◽

10.2174/1574893608999140109120957 ◽

2014 ◽

Vol 9 (1) ◽

pp. 18-22 ◽

Cited By ~ 31

Author(s):

Kohbalan Moorthy ◽

Mohd Mohamad ◽

Safaai Deris

Keyword(s):

Gene Expression ◽

Gene Expression Data ◽

Microarray Gene Expression Data ◽

Expression Data ◽

Microarray Gene Expression ◽

Missing Value ◽

Missing Value Imputation ◽

Microarray Gene

Download Full-text

Use of biclustering for missing value imputation in gene expression data

Artificial Intelligence Research ◽

10.5430/air.v2n2p96 ◽

2013 ◽

Vol 2 (2) ◽

Cited By ~ 4

Author(s):

K.O. Cheng ◽

N.F. Law ◽

W.C. Siu

Keyword(s):

Gene Expression ◽

Gene Expression Data ◽

Expression Data ◽

Missing Value ◽

Missing Value Imputation

Download Full-text

An Improved Fuzzy Based Missing Value Estimation in DNA Microarray Validated by Gene Ranking

Advances in Fuzzy Systems ◽

10.1155/2016/6134736 ◽

2016 ◽

Vol 2016 ◽

pp. 1-19 ◽

Cited By ~ 2

Author(s):

Sujay Saha ◽

Anupam Ghosh ◽

Dibyendu Bikash Seal ◽

Kashi Nath Dey

Keyword(s):

Gene Expression ◽

Gene Expression Data ◽

Missing Values ◽

Breast Cancer Dataset ◽

Expression Data ◽

Cancer Dataset ◽

Gene Ranking ◽

Missing Value ◽

Value Estimation ◽

Missing Value Estimation

Most of the gene expression data analysis algorithms require the entire gene expression matrix without any missing values. Hence, it is necessary to devise methods which would impute missing data values accurately. There exist a number of imputation algorithms to estimate those missing values. This work starts with a microarray dataset containing multiple missing values. We first apply the modified version of the fuzzy theory based existing method LRFDVImpute to impute multiple missing values of time series gene expression data and then validate the result of imputation by genetic algorithm (GA) based gene ranking methodology along with some regular statistical validation techniques, like RMSE method. Gene ranking, as far as our knowledge, has not been used yet to validate the result of missing value estimation. Firstly, the proposed method has been tested on the very popular Spellman dataset and results show that error margins have been drastically reduced compared to some previous works, which indirectly validates the statistical significance of the proposed method. Then it has been applied on four other 2-class benchmark datasets, like Colorectal Cancer tumours dataset (GDS4382), Breast Cancer dataset (GSE349-350), Prostate Cancer dataset, and DLBCL-FL (Leukaemia) for both missing value estimation and ranking the genes, and the results show that the proposed method can reach 100% classification accuracy with very few dominant genes, which indirectly validates the biological significance of the proposed method.

Download Full-text

Knowledge discovery from gene expression dataset using bagging lasso decision tree

Indonesian Journal of Electrical Engineering and Computer Science ◽

10.11591/ijeecs.v21.i2.pp1151-1159 ◽

2021 ◽

Vol 21 (2) ◽

pp. 1151

Author(s):

Umu Sa'adah ◽

Masithoh Yessi Rochayani ◽

Ani Budi Astuti

Keyword(s):

Gene Expression ◽

Decision Tree ◽

Gene Expression Data ◽

High Dimensional Data ◽

Decision Tree Model ◽

High Dimensional ◽

Expression Data ◽

Tree Model ◽

Tree Classifier ◽

Cart Algorithm

<p>Classifying high-dimensional data are a challenging task in data mining. Gene expression data is a type of high-dimensional data that has thousands of features. The study was proposing a method to extract knowledge from high-dimensional gene expression data by selecting features and classifying. Lasso was used for selecting features and the classification and regression tree (CART) algorithm was used to construct the decision tree model. To examine the stability of the lasso decision tree, we performed bootstrap aggregating (Bagging) with 50 replications. The gene expression data used was an ovarian tumor dataset that has 1,545 observations, 10,935 gene features, and binary class. The findings of this research showed that the lasso decision tree could produce an interpretable model that theoretically correct and had an accuracy of 89.32%. Meanwhile, the model obtained from the majority vote gave an accuracy of 90.29% which showed an increase in accuracy of 1% from the single lasso decision tree model. The slightly increasing accuracy shows that the lasso decision tree classifier is stable.</p>

Download Full-text

ITERATED LOCAL LEAST SQUARES MICROARRAY MISSING VALUE IMPUTATION

Journal of Bioinformatics and Computational Biology ◽

10.1142/s0219720006002302 ◽

2006 ◽

Vol 04 (05) ◽

pp. 935-957 ◽

Cited By ~ 51

Author(s):

ZHIPENG CAI ◽

MAYSAM HEYDARI ◽

GUOHUI LIN

Keyword(s):

Gene Expression ◽

Data Analysis ◽

Least Squares ◽

Gene Expression Data ◽

Missing Values ◽

Target Genes ◽

Accurate Estimation ◽

Expression Data ◽

Microarray Gene Expression ◽

Missing Value

Microarray gene expression data often contains multiple missing values due to various reasons. However, most of gene expression data analysis algorithms require complete expression data. Therefore, accurate estimation of the missing values is critical to further data analysis. In this paper, an Iterated Local Least Squares Imputation (ILLSimpute) method is proposed for estimating missing values. Two unique features of ILLSimpute method are: ILLSimpute method does not fix a common number of coherent genes for target genes for estimation purpose, but defines coherent genes as those within a distance threshold to the target genes. Secondly, in ILLSimpute method, estimated values in one iteration are used for missing value estimation in the next iteration and the method terminates after certain iterations or the imputed values converge. Experimental results on six real microarray datasets showed that ILLSimpute method performed at least as well as, and most of the time much better than, five most recent imputation methods.

Download Full-text