Efficient technique of microarray missing data imputation using clustering and weighted nearest neighbour

AbstractFor most bioinformatics statistical methods, particularly for gene expression data classification, prognosis, and prediction, a complete dataset is required. The gene sample value can be missing due to hardware failure, software failure, or manual mistakes. The missing data in gene expression research dramatically affects the analysis of the collected data. Consequently, this has become a critical problem that requires an efficient imputation algorithm to resolve the issue. This paper proposed a technique considering the local similarity structure that predicts the missing data using clustering and top K nearest neighbor approaches for imputing the missing value. A similarity-based spectral clustering approach is used that is combined with the K-means. The spectral clustering parameters, cluster size, and weighting factors are optimized, and after that, missing values are predicted. For imputing each cluster’s missing value, the top K nearest neighbor approach utilizes the concept of weighted distance. The evaluation is carried out on numerous datasets from a variety of biological areas, with experimentally inserted missing values varying from 5 to 25%. Experimental results prove that the proposed imputation technique makes accurate predictions as compared to other imputation procedures. In this paper, for performing the imputation experiments, microarray gene expression datasets consisting of information of different cancers and tumors are considered. The main contribution of this research states that local similarity-based techniques can be used for imputation even when the dataset has varying dimensionality and characteristics.

Download Full-text

The K Nearest Neighbor Algorithm for Imputation of Missing Longitudinal Prenatal Alcohol Data

10.21203/rs.3.rs-32456/v2 ◽

2021 ◽

Author(s):

Ayesha Sania ◽

Nicolo Pini ◽

Morgan Nelson ◽

Michael Myers ◽

Lauren Shuffrey ◽

...

Keyword(s):

Missing Data ◽

Missing Values ◽

Nearest Neighbor ◽

Learning Algorithm ◽

Drinking Behavior ◽

Nearest Neighbors ◽

First Trimester ◽

Epidemiologic Studies ◽

K Nearest Neighbor ◽

Timeline Followback

Abstract Background — Missing data are a source of bias in epidemiologic studies. This is problematic in alcohol research where data missingness is linked to drinking behavior. Methods — The Safe Passage study was a prospective investigation of prenatal drinking and fetal/infant outcomes (n=11,083). Daily alcohol consumption for last reported drinking day and 30 days prior was recorded using Timeline Followback method. Of 3.2 million person-days, data were missing for 0.36 million. We imputed missing data using a machine learning algorithm; “K Nearest Neighbor” (K-NN). K-NN imputes missing values for a participant using data of participants closest to it. Imputed values were weighted for the distances from nearest neighbors and matched for day of week. Validation was done on randomly deleted data for 5-15 consecutive days. Results — Data from 5 nearest neighbors and segments of 55 days provided imputed values with least imputation error. After deleting data segments from with no missing days first trimester, there was no difference between actual and predicted values for 64% of deleted segments. For 31% of the segments, imputed data were within +/-1 drink/day of the actual. Conclusions — K-NN can be used to impute missing data in longitudinal studies of alcohol use during pregnancy with high accuracy.

Download Full-text

Application of imputation methods for missing values of PM10 and O3 data: Interpolation, moving average and K-nearest neighbor methods

Environmental Health Engineering and Management ◽

10.34172/ehem.2021.25 ◽

2021 ◽

Vol 8 (3) ◽

pp. 215-226

Author(s):

Parisa Saeipourdizaj ◽

Parvin Sarbakhsh ◽

Akbar Gholampour

Keyword(s):

Missing Data ◽

Human Error ◽

Missing Values ◽

Nearest Neighbor ◽

Moving Average ◽

Classification And Regression Tree ◽

Coefficient Of Determination ◽

K Nearest Neighbor ◽

Imputation Methods ◽

Machine Failure

Background: PIn air quality studies, it is very often to have missing data due to reasons such as machine failure or human error. The approach used in dealing with such missing data can affect the results of the analysis. The main aim of this study was to review the types of missing mechanism, imputation methods, application of some of them in imputation of missing of PM10 and O3 in Tabriz, and compare their efficiency. Methods: Methods of mean, EM algorithm, regression, classification and regression tree, predictive mean matching (PMM), interpolation, moving average, and K-nearest neighbor (KNN) were used. PMM was investigated by considering the spatial and temporal dependencies in the model. Missing data were randomly simulated with 10, 20, and 30% missing values. The efficiency of methods was compared using coefficient of determination (R2 ), mean absolute error (MAE) and root mean square error (RMSE). Results: Based on the results for all indicators, interpolation, moving average, and KNN had the best performance, respectively. PMM did not perform well with and without spatio-temporal information. Conclusion: Given that the nature of pollution data always depends on next and previous information, methods that their computational nature is based on before and after information indicated better performance than others, so in the case of pollutant data, it is recommended to use these methods.

Download Full-text

The K nearest neighbor algorithm for imputation of missing longitudinal prenatal alcohol data

10.21203/rs.3.rs-32456/v1 ◽

2020 ◽

Author(s):

Ayesha Sania ◽

Nicolò Pini ◽

Morgan E. Nelson ◽

Michael M. Myers ◽

Lauren C. Shuffrey ◽

...

Keyword(s):

Missing Data ◽

Alcohol Consumption ◽

Missing Values ◽

Nearest Neighbor ◽

Learning Algorithm ◽

Drinking Behavior ◽

Nearest Neighbors ◽

K Nearest Neighbor ◽

Data Set ◽

Prenatal Alcohol

Abstract Background — Missing data are a source of bias in many epidemiologic studies. This is problematic in alcohol research where data missingness may not be random as they depend on patterns of drinking behavior. Methods — The Safe Passage Study was a prospective investigation of prenatal alcohol consumption and fetal/infant outcomes (n=11,083). Daily alcohol consumption for the last reported drinking day and 30 days prior was recorded using the Timeline Followback method. Of 3.2 million person-days, data were missing for 0.36 million. We imputed missing exposure data using a machine learning algorithm; “K Nearest Neighbor” (K-NN). K-NN imputes missing values for a participant using data of other participants closest to it. Since participants with no missing days may not be comparable to those with missing data, segments from those with complete and incomplete data were included as a reference. Imputed values were weighted for the distances from nearest neighbors and matched for day of week. We validated our approach by randomly deleting non-missing data for 5-15 consecutive days. Results — We found that data from 5 nearest neighbors (i.e. K=5) and segments of 55 days provided imputed values with least imputation error. After deleting data segments from a first trimester data set with no missing days, there was no difference between actual and predicted values for 64% of deleted segments. For 31% of the segments, imputed data were within +/-1 drink/day of the actual. Conclusions — K-NN can be used to impute missing data in longitudinal studies of alcohol use during pregnancy with high accuracy.

Download Full-text

Missing data imputation using Evolutionary k- Nearest neighbor algorithm for gene expression data

2016 Sixteenth International Conference on Advances in ICT for Emerging Regions (ICTer) ◽

10.1109/icter.2016.7829911 ◽

2016 ◽

Cited By ~ 4

Author(s):

Hiroshi de Silva ◽

A. Shehan Perera

Keyword(s):

Gene Expression ◽

Missing Data ◽

Gene Expression Data ◽

Nearest Neighbor ◽

Expression Data ◽

Data Imputation ◽

K Nearest Neighbor ◽

Nearest Neighbor Algorithm ◽

Missing Data Imputation ◽

K Nearest Neighbor Algorithm

Download Full-text

k-Nearest neighbor models for microarray gene expression analysis and clinical outcome prediction

The Pharmacogenomics Journal ◽

10.1038/tpj.2010.56 ◽

2010 ◽

Vol 10 (4) ◽

pp. 292-309 ◽

Cited By ~ 69

Author(s):

R M Parry ◽

W Jones ◽

T H Stokes ◽

J H Phan ◽

R A Moffitt ◽

...

Keyword(s):

Gene Expression ◽

Clinical Outcome ◽

Gene Expression Analysis ◽

Outcome Prediction ◽

Nearest Neighbor ◽

K Nearest Neighbor ◽

Microarray Gene Expression ◽

Clinical Outcome Prediction ◽

Microarray Gene ◽

Microarray Gene Expression Analysis

Download Full-text

A Survey On Missing Data in Machine Learning

10.21203/rs.3.rs-535520/v1 ◽

2021 ◽

Author(s):

Tlamelo Emmanuel ◽

Thabiso Maupong ◽

Dimane Mpoeleng ◽

Thabo Semong ◽

Mphago Banyatsang ◽

...

Keyword(s):

Machine Learning ◽

Missing Data ◽

Human Error ◽

Missing Values ◽

Nearest Neighbor ◽

Research Direction ◽

Machine Learning Techniques ◽

Future Research ◽

Learning Approaches ◽

K Nearest Neighbor

Abstract Machine learning has been the corner stone in analysing and extracting information from data and often a problem of missing values is encountered. Missing values occur as a result of various factors like missing completely at random, missing at random or missing not at random. All these may be as a result of system malfunction during data collection or human error during data pre-processing. Nevertheless, it is important to deal with missing values before analysing data since ignoring or omitting missing values may result in biased or misinformed analysis. In literature there have been several proposals for handling missing values. In this paper we aggregate some of the literature on missing data particularly focusing on machine learning techniques. We also give insight on how the machine learning approaches work by highlighting the key features of the proposed techniques, how they perform, their limitations and the kind of data they are most suitable for. Finally, we experiment on the K nearest neighbor and random forest imputation techniques on novel power plant induced fan data and offer some possible future research direction.

Download Full-text

CHOOSING APPROPRIATE IMPUTATION METHODS FOR MISSING DATA: A DECISION ALGORITHM ON METHODS FOR MISSING DATA

Journal of Al-Qadisiyah for Computer Science and Mathematics ◽

10.29304/jqcm.2019.11.2.588 ◽

2019 ◽

Vol 11 (2) ◽

pp. 65-73

Author(s):

Wisam A. Mahmood ◽

Mohammed S. Rashid ◽

Teaba Wala Aldeen ◽

Teaba Wala Aldeen

Keyword(s):

Missing Data ◽

Simulation Study ◽

Missing Values ◽

Nearest Neighbor ◽

Support Vector ◽

K Nearest Neighbor ◽

Decision Algorithm ◽

Imputation Methods ◽

Regression Imputation ◽

Mean Imputation

Missing values commonly happen in the realm of medical research, which is regarded creating a lot of bias in case it is neglected with poor handling. However, while dealing with such challenges, some standard statistical methods have been already developed and available, yet no credible method is available so far to infer credible estimates. The existing data size gets lowered, apart from a decrease in efficiency happens when missing values is found in a dataset. A number of imputation methods have addressed such challenges in early scholarly works for handling missing values. Some of the regular methods include complete case method, mean imputation method, Last Observation Carried Forward (LOCF) method, Expectation-Maximization (EM) algorithm, and Markov Chain Monte Carlo (MCMC), Mean Imputation (Mean), Hot Deck (HOT), Regression Imputation (Regress), K-nearest neighbor (KNN),K-Mean Clustering, Fuzzy K-Mean Clustering, Support Vector Machine, and Multiple Imputation (MI) method. In the present paper, a simulation study is attempted for carrying out an investigative exploration into the efficacy of the above mentioned archetypal imputation methods along with longitudinal data setting under missing completely at random (MCAR). We took out missingness from three cases in a block having low missingness of 5% as well as higher levels at 30% and 50%. With this simulation study, we concluded LOCF method having more bias than the other methods in most of the situations after carrying out a comparison through simulation study.

Download Full-text

ITERATED LOCAL LEAST SQUARES MICROARRAY MISSING VALUE IMPUTATION

Journal of Bioinformatics and Computational Biology ◽

10.1142/s0219720006002302 ◽

2006 ◽

Vol 04 (05) ◽

pp. 935-957 ◽

Cited By ~ 51

Author(s):

ZHIPENG CAI ◽

MAYSAM HEYDARI ◽

GUOHUI LIN

Keyword(s):

Gene Expression ◽

Data Analysis ◽

Least Squares ◽

Gene Expression Data ◽

Missing Values ◽

Target Genes ◽

Accurate Estimation ◽

Expression Data ◽

Microarray Gene Expression ◽

Missing Value

Microarray gene expression data often contains multiple missing values due to various reasons. However, most of gene expression data analysis algorithms require complete expression data. Therefore, accurate estimation of the missing values is critical to further data analysis. In this paper, an Iterated Local Least Squares Imputation (ILLSimpute) method is proposed for estimating missing values. Two unique features of ILLSimpute method are: ILLSimpute method does not fix a common number of coherent genes for target genes for estimation purpose, but defines coherent genes as those within a distance threshold to the target genes. Secondly, in ILLSimpute method, estimated values in one iteration are used for missing value estimation in the next iteration and the method terminates after certain iterations or the imputed values converge. Experimental results on six real microarray datasets showed that ILLSimpute method performed at least as well as, and most of the time much better than, five most recent imputation methods.

Download Full-text

A survey on missing data in machine learning

Journal Of Big Data ◽

10.1186/s40537-021-00516-9 ◽

2021 ◽

Vol 8 (1) ◽

Author(s):

Tlamelo Emmanuel ◽

Thabiso Maupong ◽

Dimane Mpoeleng ◽

Thabo Semong ◽

Banyatsang Mphago ◽

...

Keyword(s):

Machine Learning ◽

Missing Data ◽

Human Error ◽

Missing Values ◽

Nearest Neighbor ◽

Research Direction ◽

Machine Learning Techniques ◽

Future Research ◽

Learning Approaches ◽

K Nearest Neighbor

AbstractMachine learning has been the corner stone in analysing and extracting information from data and often a problem of missing values is encountered. Missing values occur because of various factors like missing completely at random, missing at random or missing not at random. All these may result from system malfunction during data collection or human error during data pre-processing. Nevertheless, it is important to deal with missing values before analysing data since ignoring or omitting missing values may result in biased or misinformed analysis. In literature there have been several proposals for handling missing values. In this paper, we aggregate some of the literature on missing data particularly focusing on machine learning techniques. We also give insight on how the machine learning approaches work by highlighting the key features of missing values imputation techniques, how they perform, their limitations and the kind of data they are most suitable for. We propose and evaluate two methods, the k nearest neighbor and an iterative imputation method (missForest) based on the random forest algorithm. Evaluation is performed on the Iris and novel power plant fan data with induced missing values at missingness rate of 5% to 20%. We show that both missForest and the k nearest neighbor can successfully handle missing values and offer some possible future research direction.

Download Full-text

Optimization of Missing Value Data Imputation Automatic Dependent Surveillance Broadcasting (ADS-B) Based on K-Nearest Neighbor and Genetic Algorithm

International Journal of Computer Applications Technology and Research ◽

10.7753/ijcatr0912.1003 ◽

2020 ◽

Vol 9 (12) ◽

pp. 327-331

Author(s):

Didik Hariyanto ◽

Sholeh Hadi Pramono ◽

Erni Yudaningtyas

Keyword(s):

Genetic Algorithm ◽

Monte Carlo ◽

Technology Use ◽

Missing Values ◽

Nearest Neighbor ◽

Data Imputation ◽

Flow Data ◽

K Nearest Neighbor ◽

Missing Value ◽

K Value

The flight navigation equipments technology use still conventional, namely using radar, now slowly starting to switch to Automatic Dependent Surveillance-Broadcast (ADS-B [6]. In this study, using RTL-SDR to detect aircraft and carry out tests through the Monte Carlo alltitude method, latitude, and longitude only [3]. However, in this system there is a problem regarding the missing value in the preprocessed data results / ADS-B flow data. In handling missing values, the KNN method is the most popular, but the weakness in the KNN method, can reduce the performance[9]. So a Genetic Algorithm (GA) is proposed to optimize the k value in the KNN method. The results of this study obtained a better MSE value in the imputation process. Altitude k = 3, with MSE 128668.96, Speed k = 6, with the MSE value = 457.5201, while the k value in the Heading variable k = 61 with MSE = 752.1429. For Lattitude and Longitude, the value of k = 3, MSE 9.16E-05 and k = 2 and MSE 1.68E-05.

Download Full-text