scholarly journals An Efficient K-Means Method Based on Centroid Handling for the Similarity Estimation

The main aim of this paper is to handle centroid calculation in k-means efficiently. So that the distance estimation will be more accurate and prominent results will be fetched in terms of clustering. For this PIMA database has been considered. Data preprocessing has been performed for the unwanted data removal in terms of missing values. Then centroid initialization has been performed based on centroid tuning and randomization. For distance estimation Euclidean, Pearson Coefficient, Chebyshev and Canberra algorithms has been used. In this paper the evaluation has been performed based on the computational time analysis. The time calculation has been performed on different random sets. It is found to be prominent in all the cases considering the variations in all aspects of distance and population

2013 ◽  
Vol 240 ◽  
pp. 115-128 ◽  
Author(s):  
Emil Eirola ◽  
Gauthier Doquire ◽  
Michel Verleysen ◽  
Amaury Lendasse

2020 ◽  
Vol 13 (2) ◽  
pp. 65-75
Author(s):  
Ridho Ananda ◽  
Atika Ratna Dewi ◽  
Nurlaili Nurlaili

The existence of missing values will really inhibit process of clustering. To overcome it, some of scientists have found several solutions. Both of them are imputation and special clustering algorithms. This paper compared the results of clustering by using them in incomplete data. K-means algorithms was utilized in the imputation data. The algorithms used were distribution free multiple imputation (DFMI), Gabriel eigen (GE), expectation maximization-singular value decomposition (EM-SVD), biplot imputation (BI), four algorithms of modified fuzzy c-means (FCM), k-means soft constraints (KSC), distance estimation strategy fuzzy c-means (DESFCM), k-means soft constraints imputed-observed (KSC-IO). The data used were the 2018 environmental performance index (EPI) and the simulation data. The optimal clustering on the 2018 EPI data would be chosen based on Silhouette index, where previously, it had been tested its capability in simulation dataset. The results showed that Silhouette index have the good capability to validate the clustering results in the incomplete dataset and the optimal clustering in the 2018 EPI dataset was obtained by k-means using BI where the silhouette index and time complexity were 0.613 and 0.063 respectively. Based on the results, k-means by using BI is suggested processing clustering analysis in the 2018 EPI dataset.


Author(s):  
Misha Kakkar ◽  
Sarika Jain ◽  
Abhay Bansal ◽  
P.S. Grover

Software Defect Prediction (SDP) models are used to predict, whether software is clean or buggy using the historical data collected from various software repositories. The data collected from such repositories may contain some missing values. In order to estimate missing values, imputation techniques are used, which utilizes the complete observed values in the dataset. The objective of this study is to identify the best-suited imputation technique for handling missing values in SDP dataset. In addition to identifying the imputation technique, the authors have investigated for the most appropriate combination of imputation technique and data preprocessing method for building SDP model. In this study, four combinations of imputation technique and data preprocessing methods are examined using the improved NASA datasets. These combinations are used along with five different machine-learning algorithms to develop models. The performance of these SDP models are then compared using traditional performance indicators. Experiment results show that among different imputation techniques, linear regression gives the most accurate imputed value. The combination of linear regression with correlation based feature selector outperforms all other combinations. To validate the significance of data preprocessing methods with imputation the findings are applied to open source projects. It was concluded that the result is in consistency with the above conclusion.


2013 ◽  
Vol 2013 ◽  
pp. 1-7 ◽  
Author(s):  
Lu Xu ◽  
Si-Min Yan ◽  
Chen-Bo Cai ◽  
Xiao-Ping Yu ◽  
Jian-Hui Jiang ◽  
...  

This paper aims at developing a rapid and nondestructive method for analyzing the shelf life of preserved eggs (pidan) by near infrared (NIR) spectroscopy and nonlinear multivariate calibration. A major concern with a nonlinear model is that the noncomposition-correlated spectral variations among pidan objects of different batches and production dates would unnecessarily increase model complexity and cause overfitting and degradation of prediction. To reduce the negative influence of unwanted spectral variations, stacked least squares support vector machine (LS-SVM) with an ensemble of 62 commonly used preprocessing methods is proposed to automatically optimize data preprocessing and develop the nonlinear model. The analysis results indicate that stacked LS-SVM can obtain stable calibration model, and the prediction accuracy is improved compared with models with single-preprocessing methods. Since LS-SVM is much faster than its ordinary counterparts, stacked LS-SVM with ensemble preprocessing can be performed within an acceptable computational time. When the objects and spectral variations are very complex, the proposed method can provide a useful tool for data preprocessing and nonlinear multivariate calibration.


Author(s):  
Taichi Shiiba ◽  
Yoshihiro Suda

Driving simulator requires real-time calculation of vehicle dynamics in response to driver’s input, such as steering maneuver, and throttle and brake pedal operation. The authors had developed a driving simulator with a 91-DOF multibody vehicle model, and this driving simulator has been used for the purpose of ‘virtual proving ground’, which means a virtual handling and ride test environment of automobiles with a driving simulator. Multibody analysis results can be evaluated through body sensory information such as acceleration produced by 6-axis motion base and visual information by computer graphics. For real-time analysis with multibody vehicle model, the authors developed an original MATLAB-based multibody analysis program This paper deals with the details about the environment of real-time multibody analysis for driving simulator and its performance, and applications of virtual proving ground.


2020 ◽  
Vol 17 (4A) ◽  
pp. 635-644
Author(s):  
Ahmad Al-qerem ◽  
Ghazi Al-Naymat ◽  
Mays Alhasan ◽  
Mutaz Al-Debei

For financial institutions and the banking industry, it is very crucial to have predictive models for their core financial activities, and especially those activities which play major roles in risk management. Predicting loan default is one of the critical issues that banks and financial institutions focus on, as huge revenue loss could be prevented by predicting customer’s ability not only to pay back, but also to be able to do that on time. Customer loan default prediction is a task of proactively identifying customers who are most probably to stop paying back their loans. This is usually done by dynamically analyzing customers’ relevant information and behaviors. This is significant so as the bank or the financial institution can estimate the borrowers’ risk. Many different machine learning classification models and algorithms have been used to predict customers’ ability to pay back loans. In this paper, three different classification methods (Naïve Bayes, Decision Tree, and Random Forest) are used for prediction, comprehensive different pre-processing techniques are being applied on the dataset in order to gain better data through fixing some of the main data issues like missing values and imbalanced data, and three different feature extractions algorithms are used to enhance the accuracy and the performance. Results of the competing models were varied after applying data preprocessing techniques and features selections. The results were compared using F1 accuracy measure. The best model achieved an improvement of about 40%, whilst the least performing model achieved an improvement of 3% only. This implies the significance and importance of data engineering (e.g., data preprocessing techniques and features selections) course of action in machine learning exercises


Author(s):  
Misha Kakkar ◽  
Sarika Jain ◽  
Abhay Bansal ◽  
P.S. Grover

Software Defect Prediction (SDP) models are used to predict, whether software is clean or buggy using the historical data collected from various software repositories. The data collected from such repositories may contain some missing values. In order to estimate missing values, imputation techniques are used, which utilizes the complete observed values in the dataset. The objective of this study is to identify the best-suited imputation technique for handling missing values in SDP dataset. In addition to identifying the imputation technique, the authors have investigated for the most appropriate combination of imputation technique and data preprocessing method for building SDP model. In this study, four combinations of imputation technique and data preprocessing methods are examined using the improved NASA datasets. These combinations are used along with five different machine-learning algorithms to develop models. The performance of these SDP models are then compared using traditional performance indicators. Experiment results show that among different imputation techniques, linear regression gives the most accurate imputed value. The combination of linear regression with correlation based feature selector outperforms all other combinations. To validate the significance of data preprocessing methods with imputation the findings are applied to open source projects. It was concluded that the result is in consistency with the above conclusion.


2015 ◽  
Author(s):  
Shakeel Ahmed Kamboh ◽  
Jane Labadin ◽  
Andrew Ragai Henri Rigit ◽  
Tech Chaw Ling ◽  
Khuda Bux Amur ◽  
...  

Author(s):  
Girdhar Gopal Ladha ◽  
Ravi Kumar Singh Pippal

In this paper an efficient distance estimation and centroid selection based on k-means clustering for small and large dataset. Data pre-processing was performed first on the dataset. For the complete study and analysis PIMA Indian diabetes dataset was considered. After pre-processing distance and centroid estimation was performed. It includes initial selection based on randomization and then centroids updations were performed till the iterations or epochs determined. Distance measures used here are Euclidean distance (Ed), Pearson Coefficient distance (PCd), Chebyshev distance (Csd) and Canberra distance (Cad). The results indicate that all the distance algorithms performed approximately well in case of clustering but in terms of time Cad outperforms in comparison to other algorithms.


Sign in / Sign up

Export Citation Format

Share Document