Imputing Missing Values for Mixed Numeric and Categorical Attributes Based on Incomplete Data Hierarchical Clustering

Swarm intelligence has appeared as an active field for solving numerous machine-learning tasks. In this paper, we address the problem of clustering data with missing values, where the patterns are described by mixed (or hybrid) features. We introduce a generic modification to three swarm intelligence algorithms (Artificial Bee Colony, Firefly Algorithm, and Novel Bat Algorithm). We experimentally obtain the adequate values of the parameters for these three modified algorithms, with the purpose of applying them in the clustering task. We also provide an unbiased comparison among several metaheuristics based clustering algorithms, concluding that the clusters obtained by our proposals are highly representative of the “natural structure” of data.

Download Full-text

Symmetry Breaking and Training from Incomplete Data with Radial Basis Boltzmann Machines

International Journal of Neural Systems ◽

10.1142/s0129065797000318 ◽

1997 ◽

Vol 08 (03) ◽

pp. 301-315 ◽

Cited By ~ 8

Author(s):

Marcel J. Nijman ◽

Hilbert J. Kappen

Keyword(s):

Symmetry Breaking ◽

Incomplete Data ◽

Missing Values ◽

Nearest Neighbor ◽

Boltzmann Machine ◽

K Nearest Neighbor ◽

Data Set ◽

Input Space ◽

Learning Rules ◽

Radial Basis

A Radial Basis Boltzmann Machine (RBBM) is a specialized Boltzmann Machine architecture that combines feed-forward mapping with probability estimation in the input space, and for which very efficient learning rules exist. The hidden representation of the network displays symmetry breaking as a function of the noise in the dynamics. Thus, generalization can be studied as a function of the noise in the neuron dynamics instead of as a function of the number of hidden units. We show that the RBBM can be seen as an elegant alternative of k-nearest neighbor, leading to comparable performance without the need to store all data. We show that the RBBM has good classification performance compared to the MLP. The main advantage of the RBBM is that simultaneously with the input-output mapping, a model of the input space is obtained which can be used for learning with missing values. We derive learning rules for the case of incomplete data, and show that they perform better on incomplete data than the traditional learning rules on a 'repaired' data set.

Download Full-text

Evolutionary Machine Learning for Classification with Incomplete Data

10.26686/wgtn.17072123 ◽

2021 ◽

Author(s):

◽

Cao Truong Tran

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Genetic Programming ◽

Incomplete Data ◽

Missing Values ◽

Machine Learning Techniques ◽

Feature Construction ◽

Classification Algorithms ◽

Learning Techniques ◽

Effectiveness And Efficiency

<p>Classification is a major task in machine learning and data mining. Many real-world datasets suffer from the unavoidable issue of missing values. Classification with incomplete data has to be carefully handled because inadequate treatment of missing values will cause large classification errors. Existing most researchers working on classification with incomplete data focused on improving the effectiveness, but did not adequately address the issue of the efficiency of applying the classifiers to classify unseen instances, which is much more important than the act of creating classifiers. A common approach to classification with incomplete data is to use imputation methods to replace missing values with plausible values before building classifiers and classifying unseen instances. This approach provides complete data which can be then used by any classification algorithm, but sophisticated imputation methods are usually computationally intensive, especially for the application process of classification. Another approach to classification with incomplete data is to build a classifier that can directly work with missing values. This approach does not require time for estimating missing values, but it often generates inaccurate and complex classifiers when faced with numerous missing values. A recent approach to classification with incomplete data which also avoids estimating missing values is to build a set of classifiers which then is used to select applicable classifiers for classifying unseen instances. However, this approach is also often inaccurate and takes a long time to find applicable classifiers when faced with numerous missing values. The overall goal of the thesis is to simultaneously improve the effectiveness and efficiency of classification with incomplete data by using evolutionary machine learning techniques for feature selection, clustering, ensemble learning, feature construction and constructing classifiers. The thesis develops approaches for improving imputation for classification with incomplete data by integrating clustering and feature selection with imputation. The approaches improve both the effectiveness and the efficiency of using imputation for classification with incomplete data. The thesis develops wrapper-based feature selection methods to improve input space for classification algorithms that are able to work directly with incomplete data. The methods not only improve the classification accuracy, but also reduce the complexity of classifiers able to work directly with incomplete data. The thesis develops a feature construction method to improve input space for classification algorithms with incomplete data by proposing interval genetic programming-genetic programming with a set of interval functions. The method improves the classification accuracy and reduces the complexity of classifiers. The thesis develops an ensemble approach to classification with incomplete data by integrating imputation, feature selection, and ensemble learning. The results show that the approach is more accurate, and faster than previous common methods for classification with incomplete data. The thesis develops interval genetic programming to directly evolve classifiers for incomplete data. The results show that classifiers generated by interval genetic programming can be more effective and efficient than classifiers generated the combination of imputation and traditional genetic programming. Interval genetic programming is also more effective than common classification algorithms able to work directly with incomplete data. In summary, the thesis develops a range of approaches for simultaneously improving the effectiveness and efficiency of classification with incomplete data by using a range of evolutionary machine learning techniques.</p>

Download Full-text

Clustering Algorithm for Incomplete Data Sets with Mixed Numeric and Categorical Attributes

International Journal of Database Theory and Application ◽

10.14257/ijdta.2013.6.5.09 ◽

2013 ◽

Vol 6 (5) ◽

pp. 95-104 ◽

Cited By ~ 4

Author(s):

Wu Sen ◽

Chen Hong ◽

Feng Xiaodong

Keyword(s):

Incomplete Data ◽

Clustering Algorithm ◽

Data Sets ◽

Categorical Attributes

Download Full-text

A Comparative Study Based on Rough Set and Classification Via Clustering Approaches to Handle Incomplete Data to Predict Learning Styles

International Journal of Decision Support System Technology ◽

10.4018/ijdsst.2017040101 ◽

2017 ◽

Vol 9 (2) ◽

pp. 1-20 ◽

Cited By ~ 2

Author(s):

Hemant Rana ◽

Manohar Lal

Keyword(s):

Learning Styles ◽

Rough Set ◽

Incomplete Data ◽

Missing Values ◽

Rough Set Theory ◽

Decision Rules ◽

Data Mining Tool ◽

Knowledge Analysis ◽

Clustering Approach ◽

Mining Tool

Handling of missing attribute values are a big challenge for data analysis. For handling this type of problems, there are some well known approaches, including Rough Set Theory (RST) and classification via clustering. In the work reported here, RSES (Rough Set Exploration System) one of the tools based on RST approach, and WEKA (Waikato Environment for Knowledge Analysis), a data mining tool—based on classification via clustering—are used for predicting learning styles from given data, which possibly has missing values. The results of the experiments using the tools show that the problem of missing attribute values is better handled by RST approach as compared to the classification via clustering approach. Further, in respect of missing values, RSES yields better decision rules, if the missing values are simply ignored than the rules obtained by assigning some values in place of missing attribute values.

Download Full-text

Comparison of Algorithms for Clustering Incomplete Data

Foundations of Computing and Decision Sciences ◽

10.2478/fcds-2014-0007 ◽

2014 ◽

Vol 39 (2) ◽

pp. 107-127 ◽

Cited By ~ 6

Author(s):

Artur Matyja ◽

Krzysztof Siminski

Keyword(s):

Data Analysis ◽

Incomplete Data ◽

Missing Values ◽

Real Data ◽

Complete Data ◽

The Other ◽

Data Sets ◽

Missing Value ◽

Comparison Of Algorithms ◽

New Algorithms

Abstract The missing values are not uncommon in real data sets. The algorithms and methods used for the data analysis of complete data sets cannot always be applied to missing value data. In order to use the existing methods for complete data, the missing value data sets are preprocessed. The other solution to this problem is creation of new algorithms dedicated to missing value data sets. The objective of our research is to compare the preprocessing techniques and specialised algorithms and to find their most advantageous usage.

Download Full-text

CLUSTERING INCOMPLETE SPECTRAL DATA WITH ROBUST METHODS

ISPRS - International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences ◽

10.5194/isprs-archives-xlii-3-w3-13-2017 ◽

2017 ◽

Vol XLII-3/W3 ◽

pp. 13-17

Author(s):

S. Äyrämö ◽

I. Pölönen ◽

M. A. Eskelinen

Keyword(s):

Incomplete Data ◽

Missing Values ◽

Robust Statistics ◽

Hyperspectral Image ◽

Underlying Structure ◽

Clustering Methods ◽

Full Data ◽

Robust Clustering ◽

Data Cubes ◽

Classical Statistics

Missing value imputation is a common approach for preprocessing incomplete data sets. In case of data clustering, imputation methods may cause unexpected bias because they may change the underlying structure of the data. In order to avoid prior imputation of missing values the computational operations must be projected on the available data values. In this paper, we apply a robust nan-K-spatmed algorithm to the clustering problem on hyperspectral image data. Robust statistics, such as multivariate medians, are more insensitive to outliers than classical statistics relying on the Gaussian assumptions. They are, however, computationally more intractable due to the lack of closed-form solutions. We will compare robust clustering methods on the bands incomplete data cubes to standard K-means with full data cubes.

Download Full-text

Automatic missing value imputation for cleaning phase of diabetic’s readmission prediction model

International Journal of Electrical and Computer Engineering (IJECE) ◽

10.11591/ijece.v12i2.pp2001-2013 ◽

2022 ◽

Vol 12 (2) ◽

pp. 2001

Author(s):

Jesmeen Mohd Zebaral Hoque ◽

Jakir Hossen ◽

Shohel Sayeed ◽

Chy. Mohammed Tawsif K. ◽

Jaya Ganesan ◽

...

Keyword(s):

Incomplete Data ◽

Missing Values ◽

Prediction Models ◽

Low Cost ◽

Support Vector ◽

Data Sampling ◽

Data Set ◽

Missing Value ◽

Missing Value Imputation ◽

Proper Analysis

Recently, the industry of healthcare started generating a large volume of datasets. If hospitals can employ the data, they could easily predict the outcomes and provide better treatments at early stages with low cost. Here, data analytics (DA) was used to make correct decisions through proper analysis and prediction. However, inappropriate data may lead to flawed analysis and thus yield unacceptable conclusions. Hence, transforming the improper data from the entire data set into useful data is essential. Machine learning (ML) technique was used to overcome the issues due to incomplete data. A new architecture, automatic missing value imputation (AMVI) was developed to predict missing values in the dataset, including data sampling and feature selection. Four prediction models (i.e., logistic regression, support vector machine (SVM), AdaBoost, and random forest algorithms) were selected from the well-known classification. The complete AMVI architecture performance was evaluated using a structured data set obtained from the UCI repository. Accuracy of around 90% was achieved. It was also confirmed from cross-validation that the trained ML model is suitable and not over-fitted. This trained model is developed based on the dataset, which is not dependent on a specific environment. It will train and obtain the outperformed model depending on the data available.

Download Full-text

A COMPARISON OF CLUSTERING BY IMPUTATION AND SPECIAL CLUSTERING ALGORITHMS ON THE REAL INCOMPLETE DATA

Jurnal Ilmu Komputer dan Informasi ◽

10.21609/jiki.v13i2.818 ◽

2020 ◽

Vol 13 (2) ◽

pp. 65-75

Author(s):

Ridho Ananda ◽

Atika Ratna Dewi ◽

Nurlaili Nurlaili

Keyword(s):

Expectation Maximization ◽

Incomplete Data ◽

Missing Values ◽

Clustering Algorithms ◽

Distance Estimation ◽

Soft Constraints ◽

Fuzzy C Means ◽

Environmental Performance Index ◽

Silhouette Index ◽

Value Decomposition

The existence of missing values will really inhibit process of clustering. To overcome it, some of scientists have found several solutions. Both of them are imputation and special clustering algorithms. This paper compared the results of clustering by using them in incomplete data. K-means algorithms was utilized in the imputation data. The algorithms used were distribution free multiple imputation (DFMI), Gabriel eigen (GE), expectation maximization-singular value decomposition (EM-SVD), biplot imputation (BI), four algorithms of modified fuzzy c-means (FCM), k-means soft constraints (KSC), distance estimation strategy fuzzy c-means (DESFCM), k-means soft constraints imputed-observed (KSC-IO). The data used were the 2018 environmental performance index (EPI) and the simulation data. The optimal clustering on the 2018 EPI data would be chosen based on Silhouette index, where previously, it had been tested its capability in simulation dataset. The results showed that Silhouette index have the good capability to validate the clustering results in the incomplete dataset and the optimal clustering in the 2018 EPI dataset was obtained by k-means using BI where the silhouette index and time complexity were 0.613 and 0.063 respectively. Based on the results, k-means by using BI is suggested processing clustering analysis in the 2018 EPI dataset.

Download Full-text

MODIFIED POSSIBILISTIC FUZZY C-MEANS ALGORITHM FOR CLUSTERING INCOMPLETE DATA SETS

Acta Polytechnica ◽

10.14311/ap.2021.61.0364 ◽

2021 ◽

Vol 61 (2) ◽

pp. 364-377

Author(s):

. Rustam ◽

Koredianto Usman ◽

Mudyawati Kamaruddin ◽

Dina Chamidah ◽

. Nopendri ◽

...

Keyword(s):

Experimental Data ◽

Incomplete Data ◽

Missing Values ◽

Complete Data ◽

Noise Sensitivity ◽

Data Sets ◽

Fuzzy C Means ◽

Number Of Iterations ◽

Fuzzy C Means Algorithm

A possibilistic fuzzy c-means (PFCM) algorithm is a reliable algorithm proposed to deal with the weaknesses associated with handling noise sensitivity and coincidence clusters in fuzzy c-means (FCM) and possibilistic c-means (PCM). However, the PFCM algorithm is only applicable to complete data sets. Therefore, this research modified the PFCM for clustering incomplete data sets to OCSPFCM and NPSPFCM with the performance evaluated based on three aspects, 1) accuracy percentage, 2) the number of iterations, and 3) centroid errors. The results showed that the NPSPFCM outperforms the OCSPFCM with missing values ranging from 5% − 30% for all experimental data sets. Furthermore, both algorithms provide average accuracies between 97.75%−78.98% and 98.86%−92.49%, respectively.

Download Full-text