scholarly journals Missing Value Estimation Methods Research for Arrhythmia Classification Using the Modified Kernel Difference-Weighted KNN Algorithms

2020 ◽  
Vol 2020 ◽  
pp. 1-9
Author(s):  
Fei Yang ◽  
Jiazhi Du ◽  
Jiying Lang ◽  
Weigang Lu ◽  
Lei Liu ◽  
...  

Electrocardiogram (ECG) signal is critical to the classification of cardiac arrhythmia using some machine learning methods. In practice, the ECG datasets are usually with multiple missing values due to faults or distortion. Unfortunately, many established algorithms for classification require a fully complete matrix as input. Thus it is necessary to impute the missing data to increase the effectiveness of classification for datasets with a few missing values. In this paper, we compare the main methods for estimating the missing values in electrocardiogram data, e.g., the “Zero method”, “Mean method”, “PCA-based method”, and “RPCA-based method” and then propose a novel KNN-based classification algorithm, i.e., a modified kernel Difference-Weighted KNN classifier (MKDF-WKNN), which is fit for the classification of imbalance datasets. The experimental results on the UCI database indicate that the “RPCA-based method” can successfully handle missing values in arrhythmia dataset no matter how many values in it are missing and our proposed classification algorithm, MKDF-WKNN, is superior to other state-of-the-art algorithms like KNN, DS-WKNN, DF-WKNN, and KDF-WKNN for uneven datasets which impacts the accuracy of classification.

2013 ◽  
Vol 2013 ◽  
pp. 1-5 ◽  
Author(s):  
Fuxi Shi ◽  
Dan Zhang ◽  
Jun Chen ◽  
Hamid Reza Karimi

Missing values are prevalent in microarray data, they course negative influence on downstream microarray analyses, and thus they should be estimated from known values. We propose a BPCA-iLLS method, which is an integration of two commonly used missing value estimation methods—Bayesian principal component analysis (BPCA) and local least squares (LLS). The inferior row-average procedure in LLS is replaced with BPCA, and the least squares method is put into an iterative framework. Comparative result shows that the proposed method has obtained the highest estimation accuracy across all missing rates on different types of testing datasets.


2017 ◽  
Vol 46 (02) ◽  
pp. 317-326 ◽  
Author(s):  
ADILAH ABDUL GHAPOR ◽  
YONG ZULINA ZUBAIRI ◽  
RAHMATULLAH IMON A.H.M.

2001 ◽  
Vol 17 (6) ◽  
pp. 520-525 ◽  
Author(s):  
O. Troyanskaya ◽  
M. Cantor ◽  
G. Sherlock ◽  
P. Brown ◽  
T. Hastie ◽  
...  

2016 ◽  
Vol 2016 ◽  
pp. 1-19 ◽  
Author(s):  
Sujay Saha ◽  
Anupam Ghosh ◽  
Dibyendu Bikash Seal ◽  
Kashi Nath Dey

Most of the gene expression data analysis algorithms require the entire gene expression matrix without any missing values. Hence, it is necessary to devise methods which would impute missing data values accurately. There exist a number of imputation algorithms to estimate those missing values. This work starts with a microarray dataset containing multiple missing values. We first apply the modified version of the fuzzy theory based existing method LRFDVImpute to impute multiple missing values of time series gene expression data and then validate the result of imputation by genetic algorithm (GA) based gene ranking methodology along with some regular statistical validation techniques, like RMSE method. Gene ranking, as far as our knowledge, has not been used yet to validate the result of missing value estimation. Firstly, the proposed method has been tested on the very popular Spellman dataset and results show that error margins have been drastically reduced compared to some previous works, which indirectly validates the statistical significance of the proposed method. Then it has been applied on four other 2-class benchmark datasets, like Colorectal Cancer tumours dataset (GDS4382), Breast Cancer dataset (GSE349-350), Prostate Cancer dataset, and DLBCL-FL (Leukaemia) for both missing value estimation and ranking the genes, and the results show that the proposed method can reach 100% classification accuracy with very few dominant genes, which indirectly validates the biological significance of the proposed method.


2007 ◽  
Vol 05 (05) ◽  
pp. 1005-1022 ◽  
Author(s):  
ELENA TSIPORKOVA ◽  
VESELKA BOEVA

Gene expression microarray experiments frequently generate datasets with multiple values missing. However, most of the analysis, mining, and classification methods for gene expression data require a complete matrix of gene array values. Therefore, the accurate estimation of missing values in such datasets has been recognized as an important issue, and several imputation algorithms have already been proposed to the biological community. Most of these approaches, however, are not particularly suitable for time series expression profiles. In view of this, we propose a novel imputation algorithm, which is specially suited for the estimation of missing values in gene expression time series data. The algorithm utilizes Dynamic Time Warping (DTW) distance in order to measure the similarity between time expression profiles, and subsequently selects for each gene expression profile with missing values a dedicated set of candidate profiles for estimation. Three different DTW-based imputation (DTWimpute) algorithms have been considered: position-wise, neighborhood-wise, and two-pass imputation. These have initially been prototyped in Perl, and their accuracy has been evaluated on yeast expression time series data using several different parameter settings. The experiments have shown that the two-pass algorithm consistently outperforms, in particular for datasets with a higher level of missing entries, the neighborhood-wise and the position-wise algorithms. The performance of the two-pass DTWimpute algorithm has further been benchmarked against the weighted K-Nearest Neighbors algorithm, which is widely used in the biological community; the former algorithm has appeared superior to the latter one. Motivated by these findings, indicating clearly the added value of the DTW techniques for missing value estimation in time series data, we have built an optimized C++ implementation of the two-pass DTWimpute algorithm. The software also provides for a choice between three different initial rough imputation methods.


2020 ◽  
Vol 4 (2) ◽  
pp. 15-27
Author(s):  
Recep Sinan ARSLAN ◽  
Ahmet Haşim Yurttakal

ABSTRACT Android application platform is making rapid progress in these days. This development has made it the target of malicious application developers. This situation provides a numerical increase in malware apps, diversity in techniques, and rise of damage. Therefore, it is very critical to detect these software and escalation the security of mobile users. Static and dynamic analysis, behaviour scrutiny, machine learning methods are used to ensure security. In this study, K-nearest Neighbourhood (KNN) classifier, one of the machine learning methods, is used. Thus, it is aimed to detect malignant mobile software successfully and quickly. The tests is conducted with dataset includes 492 malware and 697 benign applications. In the proposed algorithm, neighbour number 5 and distance metric is preferred as Minkowski. 80% of dataset randomly selected is reserved for training and 20% for testing. As a result, while 94.1% accuracy is achieved, precision 91.2%, recall 92.7% recall and f1-measure is 92.4%. The high value obtained in f1-measure shows that the proposed model is successful in detecting both malware and benevolent software. The success of using KNN algorithm in classification of malicious apps in the Android has been demonstrated.


Sign in / Sign up

Export Citation Format

Share Document