scholarly journals A COMPARISON OF CLUSTERING BY IMPUTATION AND SPECIAL CLUSTERING ALGORITHMS ON THE REAL INCOMPLETE DATA

2020 ◽  
Vol 13 (2) ◽  
pp. 65-75
Author(s):  
Ridho Ananda ◽  
Atika Ratna Dewi ◽  
Nurlaili Nurlaili

The existence of missing values will really inhibit process of clustering. To overcome it, some of scientists have found several solutions. Both of them are imputation and special clustering algorithms. This paper compared the results of clustering by using them in incomplete data. K-means algorithms was utilized in the imputation data. The algorithms used were distribution free multiple imputation (DFMI), Gabriel eigen (GE), expectation maximization-singular value decomposition (EM-SVD), biplot imputation (BI), four algorithms of modified fuzzy c-means (FCM), k-means soft constraints (KSC), distance estimation strategy fuzzy c-means (DESFCM), k-means soft constraints imputed-observed (KSC-IO). The data used were the 2018 environmental performance index (EPI) and the simulation data. The optimal clustering on the 2018 EPI data would be chosen based on Silhouette index, where previously, it had been tested its capability in simulation dataset. The results showed that Silhouette index have the good capability to validate the clustering results in the incomplete dataset and the optimal clustering in the 2018 EPI dataset was obtained by k-means using BI where the silhouette index and time complexity were 0.613 and 0.063 respectively. Based on the results, k-means by using BI is suggested processing clustering analysis in the 2018 EPI dataset.

Mathematics ◽  
2021 ◽  
Vol 9 (7) ◽  
pp. 786
Author(s):  
Yenny Villuendas-Rey ◽  
Eley Barroso-Cubas ◽  
Oscar Camacho-Nieto ◽  
Cornelio Yáñez-Márquez

Swarm intelligence has appeared as an active field for solving numerous machine-learning tasks. In this paper, we address the problem of clustering data with missing values, where the patterns are described by mixed (or hybrid) features. We introduce a generic modification to three swarm intelligence algorithms (Artificial Bee Colony, Firefly Algorithm, and Novel Bat Algorithm). We experimentally obtain the adequate values of the parameters for these three modified algorithms, with the purpose of applying them in the clustering task. We also provide an unbiased comparison among several metaheuristics based clustering algorithms, concluding that the clusters obtained by our proposals are highly representative of the “natural structure” of data.


2021 ◽  
Vol 61 (2) ◽  
pp. 364-377
Author(s):  
. Rustam ◽  
Koredianto Usman ◽  
Mudyawati Kamaruddin ◽  
Dina Chamidah ◽  
. Nopendri ◽  
...  

A possibilistic fuzzy c-means (PFCM) algorithm is a reliable algorithm proposed to deal with the weaknesses associated with handling noise sensitivity and coincidence clusters in fuzzy c-means (FCM) and possibilistic c-means (PCM). However, the PFCM algorithm is only applicable to complete data sets. Therefore, this research modified the PFCM for clustering incomplete data sets to OCSPFCM and NPSPFCM with the performance evaluated based on three aspects, 1) accuracy percentage, 2) the number of iterations, and 3) centroid errors. The results showed that the NPSPFCM outperforms the OCSPFCM with missing values ranging from 5% − 30% for all experimental data sets. Furthermore, both algorithms provide average accuracies between 97.75%−78.98% and 98.86%−92.49%, respectively.


2016 ◽  
Vol 2016 ◽  
pp. 1-8 ◽  
Author(s):  
Jinhua Li ◽  
Shiji Song ◽  
Yuli Zhang ◽  
Zhen Zhou

Incomplete data with missing feature values are prevalent in clustering problems. Traditional clustering methods first estimate the missing values by imputation and then apply the classical clustering algorithms for complete data, such as K-median and K-means. However, in practice, it is often hard to obtain accurate estimation of the missing values, which deteriorates the performance of clustering. To enhance the robustness of clustering algorithms, this paper represents the missing values by interval data and introduces the concept of robust cluster objective function. A minimax robust optimization (RO) formulation is presented to provide clustering results, which are insensitive to estimation errors. To solve the proposed RO problem, we propose robust K-median and K-means clustering algorithms with low time and space complexity. Comparisons and analysis of experimental results on both artificially generated and real-world incomplete data sets validate the robustness and effectiveness of the proposed algorithms.


2016 ◽  
Vol 26 (4) ◽  
pp. 871-884 ◽  
Author(s):  
Loai Abdallah ◽  
Ilan Shimshoni

AbstractMissing values in data are common in real world applications. There are several methods that deal with this problem. In this paper we present lookahead selective sampling (LSS) algorithms for datasets with missing values. We developed two versions of selective sampling. The first one integrates a distance function that can measure the similarity between pairs of incomplete points within the framework of the LSS algorithm. The second algorithm uses ensemble clustering in order to represent the data in a cluster matrix without missing values and then run the LSS algorithm based on the ensemble clustering instance space (LSS-EC). To construct the cluster matrix, we use the k-means and mean shift clustering algorithms especially modified to deal with incomplete datasets. We tested our algorithms on six standard numerical datasets from different fields. On these datasets we simulated missing values and compared the performance of the LSS and LSS-EC algorithms for incomplete data to two other basic methods. Our experiments show that the suggested selective sampling algorithms outperform the other methods.


2021 ◽  
pp. 1-13
Author(s):  
Sonia Goel ◽  
Meena Tushir

Semi-supervised clustering technique partitions the unlabeled data based on prior knowledge of labeled data. Most of the semi-supervised clustering algorithms exist only for the clustering of complete data, i.e., the data sets with no missing features. In this paper, an effort has been made to check the effectiveness of semi-supervised clustering when applied to incomplete data sets. The novelty of this approach is that it considers the missing features along with available knowledge (labels) of the data set. The linear interpolation imputation technique initially imputes the missing features of the data set, thus completing the data set. A semi-supervised clustering is now employed on this complete data set, and missing features are regularly updated within the clustering process. In the proposed work, the labeled percentage range used is 30, 40, 50, and 60% of the total data. Data is further altered by arbitrarily eliminating certain features of its components, which makes the data incomplete with partial labeling. The proposed algorithm utilizes both labeled and unlabeled data, along with certain missing values in the data. The proposed algorithm is evaluated using three performance indices, namely the misclassification rate, random index metric, and error rate. Despite the additional missing features, the proposed algorithm has been successfully implemented on real data sets and showed better/competing results than well-known standard semi-supervised clustering methods.


2021 ◽  
Vol 3 (6) ◽  
Author(s):  
R. Sekhar ◽  
K. Sasirekha ◽  
P. S. Raja ◽  
K. Thangavel

Abstract Intrusion Detection Systems (IDSs) have received more attention to safeguarding the vital information in a network system of an organization. Generally, the hackers are easily entering into a secured network through loopholes and smart attacks. In such situation, predicting attacks from normal packets is tedious, much challenging, time consuming and highly technical. As a result, different algorithms with varying learning and training capacity have been explored in the literature. However, the existing Intrusion Detection methods could not meet the desired performance requirements. Hence, this work proposes a new Intrusion Detection technique using Deep Autoencoder with Fruitfly Optimization. Initially, missing values in the dataset have been imputed with the Fuzzy C-Means Rough Parameter (FCMRP) algorithm which handles the imprecision in datasets with the exploit of fuzzy and rough sets while preserving crucial information. Then, robust features are extracted from Autoencoder with multiple hidden layers. Finally, the obtained features are fed to Back Propagation Neural Network (BPN) to classify the attacks. Furthermore, the neurons in the hidden layers of Deep Autoencoder are optimized with population based Fruitfly Optimization algorithm. Experiments have been conducted on NSL_KDD and UNSW-NB15 dataset. The computational results of the proposed intrusion detection system using deep autoencoder with BPN are compared with Naive Bayes, Support Vector Machine (SVM), Radial Basis Function Network (RBFN), BPN, and Autoencoder with Softmax. Article Highlights A hybridized model using Deep Autoencoder with Fruitfly Optimization is introduced to classify the attacks. Missing values have been imputed with the Fuzzy C-Means Rough Parameter method. The discriminate features are extracted using Deep Autoencoder with more hidden layers.


1997 ◽  
Vol 08 (03) ◽  
pp. 301-315 ◽  
Author(s):  
Marcel J. Nijman ◽  
Hilbert J. Kappen

A Radial Basis Boltzmann Machine (RBBM) is a specialized Boltzmann Machine architecture that combines feed-forward mapping with probability estimation in the input space, and for which very efficient learning rules exist. The hidden representation of the network displays symmetry breaking as a function of the noise in the dynamics. Thus, generalization can be studied as a function of the noise in the neuron dynamics instead of as a function of the number of hidden units. We show that the RBBM can be seen as an elegant alternative of k-nearest neighbor, leading to comparable performance without the need to store all data. We show that the RBBM has good classification performance compared to the MLP. The main advantage of the RBBM is that simultaneously with the input-output mapping, a model of the input space is obtained which can be used for learning with missing values. We derive learning rules for the case of incomplete data, and show that they perform better on incomplete data than the traditional learning rules on a 'repaired' data set.


Sign in / Sign up

Export Citation Format

Share Document