An Efficient K-Means Method Based on Centroid Handling for the Similarity Estimation

The existence of missing values will really inhibit process of clustering. To overcome it, some of scientists have found several solutions. Both of them are imputation and special clustering algorithms. This paper compared the results of clustering by using them in incomplete data. K-means algorithms was utilized in the imputation data. The algorithms used were distribution free multiple imputation (DFMI), Gabriel eigen (GE), expectation maximization-singular value decomposition (EM-SVD), biplot imputation (BI), four algorithms of modified fuzzy c-means (FCM), k-means soft constraints (KSC), distance estimation strategy fuzzy c-means (DESFCM), k-means soft constraints imputed-observed (KSC-IO). The data used were the 2018 environmental performance index (EPI) and the simulation data. The optimal clustering on the 2018 EPI data would be chosen based on Silhouette index, where previously, it had been tested its capability in simulation dataset. The results showed that Silhouette index have the good capability to validate the clustering results in the incomplete dataset and the optimal clustering in the 2018 EPI dataset was obtained by k-means using BI where the silhouette index and time complexity were 0.613 and 0.063 respectively. Based on the results, k-means by using BI is suggested processing clustering analysis in the 2018 EPI dataset.

Download Full-text

Computational Time Analysis of Signal Processing Algorithm-An Analysis

IOSR Journal of Electronics and Communication Engineering ◽

10.9790/2834-0542130 ◽

2013 ◽

Vol 5 (4) ◽

pp. 21-30

Author(s):

Gokul P Gokul P

Keyword(s):

Signal Processing ◽

Processing Algorithm ◽

Computational Time ◽

Time Analysis ◽

Signal Processing Algorithm

Download Full-text

Combining Data Preprocessing Methods With Imputation Techniques for Software Defect Prediction

International Journal of Open Source Software and Processes ◽

10.4018/ijossp.2018010101 ◽

2018 ◽

Vol 9 (1) ◽

pp. 1-19 ◽

Cited By ~ 1

Author(s):

Misha Kakkar ◽

Sarika Jain ◽

Abhay Bansal ◽

P.S. Grover

Keyword(s):

Linear Regression ◽

Missing Values ◽

Data Preprocessing ◽

Machine Learning Algorithms ◽

Defect Prediction ◽

Software Defect Prediction ◽

Software Defect ◽

Combining Data ◽

Feature Selector ◽

Traditional Performance

Software Defect Prediction (SDP) models are used to predict, whether software is clean or buggy using the historical data collected from various software repositories. The data collected from such repositories may contain some missing values. In order to estimate missing values, imputation techniques are used, which utilizes the complete observed values in the dataset. The objective of this study is to identify the best-suited imputation technique for handling missing values in SDP dataset. In addition to identifying the imputation technique, the authors have investigated for the most appropriate combination of imputation technique and data preprocessing method for building SDP model. In this study, four combinations of imputation technique and data preprocessing methods are examined using the improved NASA datasets. These combinations are used along with five different machine-learning algorithms to develop models. The performance of these SDP models are then compared using traditional performance indicators. Experiment results show that among different imputation techniques, linear regression gives the most accurate imputed value. The combination of linear regression with correlation based feature selector outperforms all other combinations. To validate the significance of data preprocessing methods with imputation the findings are applied to open source projects. It was concluded that the result is in consistency with the above conclusion.

Download Full-text

Nonlinear Multivariate Calibration of Shelf Life of Preserved Eggs (Pidan) by Near Infrared Spectroscopy: Stacked Least Squares Support Vector Machine with Ensemble Preprocessing

Journal of Spectroscopy ◽

10.1155/2013/797302 ◽

2013 ◽

Vol 2013 ◽

pp. 1-7 ◽

Cited By ~ 1

Author(s):

Lu Xu ◽

Si-Min Yan ◽

Chen-Bo Cai ◽

Xiao-Ping Yu ◽

Jian-Hui Jiang ◽

...

Keyword(s):

Support Vector Machine ◽

Shelf Life ◽

Least Squares ◽

Nonlinear Model ◽

Near Infrared ◽

Multivariate Calibration ◽

Data Preprocessing ◽

Computational Time ◽

Support Vector ◽

Calibration Model

This paper aims at developing a rapid and nondestructive method for analyzing the shelf life of preserved eggs (pidan) by near infrared (NIR) spectroscopy and nonlinear multivariate calibration. A major concern with a nonlinear model is that the noncomposition-correlated spectral variations among pidan objects of different batches and production dates would unnecessarily increase model complexity and cause overfitting and degradation of prediction. To reduce the negative influence of unwanted spectral variations, stacked least squares support vector machine (LS-SVM) with an ensemble of 62 commonly used preprocessing methods is proposed to automatically optimize data preprocessing and develop the nonlinear model. The analysis results indicate that stacked LS-SVM can obtain stable calibration model, and the prediction accuracy is improved compared with models with single-preprocessing methods. Since LS-SVM is much faster than its ordinary counterparts, stacked LS-SVM with ensemble preprocessing can be performed within an acceptable computational time. When the objects and spectral variations are very complex, the proposed method can provide a useful tool for data preprocessing and nonlinear multivariate calibration.

Download Full-text

Real-Time Multibody Analysis Environment for Driving Simulator

Volume 6: 5th International Conference on Multibody Systems, Nonlinear Dynamics, and Control, Parts A, B, and C ◽

10.1115/detc2005-84625 ◽

2005 ◽

Cited By ~ 1

Author(s):

Taichi Shiiba ◽

Yoshihiro Suda

Keyword(s):

Real Time ◽

Visual Information ◽

Driving Simulator ◽

Sensory Information ◽

Time Analysis ◽

Vehicle Model ◽

Test Environment ◽

Real Time Analysis ◽

Time Calculation ◽

Analysis Environment

Driving simulator requires real-time calculation of vehicle dynamics in response to driver’s input, such as steering maneuver, and throttle and brake pedal operation. The authors had developed a driving simulator with a 91-DOF multibody vehicle model, and this driving simulator has been used for the purpose of ‘virtual proving ground’, which means a virtual handling and ride test environment of automobiles with a driving simulator. Multibody analysis results can be evaluated through body sensory information such as acceleration produced by 6-axis motion base and visual information by computer graphics. For real-time analysis with multibody vehicle model, the authors developed an original MATLAB-based multibody analysis program This paper deals with the details about the environment of real-time multibody analysis for driving simulator and its performance, and applications of virtual proving ground.

Download Full-text

Default Prediction Model: The Significant Role of Data Engineering in the Quality of Outcomes

The International Arab Journal of Information Technology ◽

10.34028/iajit/17/4a/8 ◽

2020 ◽

Vol 17 (4A) ◽

pp. 635-644

Author(s):

Ahmad Al-qerem ◽

Ghazi Al-Naymat ◽

Mays Alhasan ◽

Mutaz Al-Debei

Keyword(s):

Machine Learning ◽

Financial Institutions ◽

Missing Values ◽

Financial Institution ◽

Data Preprocessing ◽

Loan Default ◽

Machine Learning Classification ◽

Default Prediction ◽

Data Engineering ◽

Accuracy Measure

For financial institutions and the banking industry, it is very crucial to have predictive models for their core financial activities, and especially those activities which play major roles in risk management. Predicting loan default is one of the critical issues that banks and financial institutions focus on, as huge revenue loss could be prevented by predicting customer’s ability not only to pay back, but also to be able to do that on time. Customer loan default prediction is a task of proactively identifying customers who are most probably to stop paying back their loans. This is usually done by dynamically analyzing customers’ relevant information and behaviors. This is significant so as the bank or the financial institution can estimate the borrowers’ risk. Many different machine learning classification models and algorithms have been used to predict customers’ ability to pay back loans. In this paper, three different classification methods (Naïve Bayes, Decision Tree, and Random Forest) are used for prediction, comprehensive different pre-processing techniques are being applied on the dataset in order to gain better data through fixing some of the main data issues like missing values and imbalanced data, and three different feature extractions algorithms are used to enhance the accuracy and the performance. Results of the competing models were varied after applying data preprocessing techniques and features selections. The results were compared using F1 accuracy measure. The best model achieved an improvement of about 40%, whilst the least performing model achieved an improvement of 3% only. This implies the significance and importance of data engineering (e.g., data preprocessing techniques and features selections) course of action in machine learning exercises

Download Full-text

Combining Data Preprocessing Methods With Imputation Techniques for Software Defect Prediction

Research Anthology on Recent Trends, Tools, and Implications of Computer Programming ◽

10.4018/978-1-7998-3016-0.ch081 ◽

2021 ◽

pp. 1792-1811

Author(s):

Misha Kakkar ◽

Sarika Jain ◽

Abhay Bansal ◽

P.S. Grover

Keyword(s):

Linear Regression ◽

Missing Values ◽

Data Preprocessing ◽

Machine Learning Algorithms ◽

Defect Prediction ◽

Software Defect Prediction ◽

Software Defect ◽

Combining Data ◽

Feature Selector ◽

Traditional Performance

Software Defect Prediction (SDP) models are used to predict, whether software is clean or buggy using the historical data collected from various software repositories. The data collected from such repositories may contain some missing values. In order to estimate missing values, imputation techniques are used, which utilizes the complete observed values in the dataset. The objective of this study is to identify the best-suited imputation technique for handling missing values in SDP dataset. In addition to identifying the imputation technique, the authors have investigated for the most appropriate combination of imputation technique and data preprocessing method for building SDP model. In this study, four combinations of imputation technique and data preprocessing methods are examined using the improved NASA datasets. These combinations are used along with five different machine-learning algorithms to develop models. The performance of these SDP models are then compared using traditional performance indicators. Experiment results show that among different imputation techniques, linear regression gives the most accurate imputed value. The combination of linear regression with correlation based feature selector outperforms all other combinations. To validate the significance of data preprocessing methods with imputation the findings are applied to open source projects. It was concluded that the result is in consistency with the above conclusion.

Download Full-text

Computational time analysis of the numerical solution of 3D electrostatic Poisson’s equation

10.1063/1.4915733 ◽

2015 ◽

Author(s):

Shakeel Ahmed Kamboh ◽

Jane Labadin ◽

Andrew Ragai Henri Rigit ◽

Tech Chaw Ling ◽

Khuda Bux Amur ◽

...

Keyword(s):

Numerical Solution ◽

Poisson’S Equation ◽

Computational Time ◽

Time Analysis ◽

Poisson's Equation

Download Full-text

An efficient distance estimation and centroid selection based on k-means clustering for small and large dataset

International Journal of Advanced Technology and Engineering Exploration ◽

10.19101/ijatee.2020.762109 ◽

2020 ◽

Vol 7 (73) ◽

pp. 234-240

Author(s):

Girdhar Gopal Ladha ◽

Ravi Kumar Singh Pippal

Keyword(s):

Euclidean Distance ◽

Distance Estimation ◽

Distance Measures ◽

Large Dataset ◽

Pearson Coefficient ◽

Complete Study ◽

Centroid Estimation ◽

Canberra Distance ◽

Initial Selection

In this paper an efficient distance estimation and centroid selection based on k-means clustering for small and large dataset. Data pre-processing was performed first on the dataset. For the complete study and analysis PIMA Indian diabetes dataset was considered. After pre-processing distance and centroid estimation was performed. It includes initial selection based on randomization and then centroids updations were performed till the iterations or epochs determined. Distance measures used here are Euclidean distance (Ed), Pearson Coefficient distance (PCd), Chebyshev distance (Csd) and Canberra distance (Cad). The results indicate that all the distance algorithms performed approximately well in case of clustering but in terms of time Cad outperforms in comparison to other algorithms.

Download Full-text