scholarly journals Perbandingan Akurasi Euclidean Distance, Minkowski Distance, dan Manhattan Distance pada Algoritma K-Means Clustering berbasis Chi-Square

2019 ◽  
Vol 4 (1) ◽  
pp. 20-24 ◽  
Author(s):  
M Nishom
Author(s):  
Mahinda Mailagaha Kumbure ◽  
Pasi Luukka

AbstractThe fuzzy k-nearest neighbor (FKNN) algorithm, one of the most well-known and effective supervised learning techniques, has often been used in data classification problems but rarely in regression settings. This paper introduces a new, more general fuzzy k-nearest neighbor regression model. Generalization is based on the usage of the Minkowski distance instead of the usual Euclidean distance. The Euclidean distance is often not the optimal choice for practical problems, and better results can be obtained by generalizing this. Using the Minkowski distance allows the proposed method to obtain more reasonable nearest neighbors to the target sample. Another key advantage of this method is that the nearest neighbors are weighted by fuzzy weights based on their similarity to the target sample, leading to the most accurate prediction through a weighted average. The performance of the proposed method is tested with eight real-world datasets from different fields and benchmarked to the k-nearest neighbor and three other state-of-the-art regression methods. The Manhattan distance- and Euclidean distance-based FKNNreg methods are also implemented, and the results are compared. The empirical results show that the proposed Minkowski distance-based fuzzy regression (Md-FKNNreg) method outperforms the benchmarks and can be a good algorithm for regression problems. In particular, the Md-FKNNreg model gave the significantly lowest overall average root mean square error (0.0769) of all other regression methods used. As a special case of the Minkowski distance, the Manhattan distance yielded the optimal conditions for Md-FKNNreg and achieved the best performance for most of the datasets.


Author(s):  
Damar Riyadi ◽  
Aina Musdholifah

This study aims to improve the performance of Case-Based Reasoning by utilizing cluster analysis which is used as an indexing method to speed up case retrieval in CBR. The clustering method uses Local Triangular Kernel-based Clustering (LTKC). The cosine coefficient method is used for finding the relevant cluster while similarity value is calculated using Manhattan distance, Euclidean distance, and Minkowski distance. Results of those methods will be compared to find which method gives the best result. This study uses three test data: malnutrition disease, heart disease, and thyroid disease. Test results showed that CBR with LTKC-indexing has better accuracy and processing time than CBR without indexing. The best accuracy on threshold 0.9 of malnutrition disease, obtained using the Euclidean distance which produces 100% accuracy and 0.0722 seconds average retrieval time. The best accuracy on threshold 0.9 of heart disease, obtained using the Minkowski distance which produces 95% accuracy and 0.1785 seconds average retrieval time. The best accuracy on threshold 0.9 of thyroid disease, obtained using the Minkowski distance which produces 92.52% accuracy and 0.3045 average retrieval time. The accuracy comparison of CBR with SOM-indexing, DBSCAN-indexing, and LTKC-indexing for malnutrition diseases and heart disease resulted that they have almost equal accuracy.


Techno Com ◽  
2021 ◽  
Vol 20 (2) ◽  
pp. 186-197
Author(s):  
Rahmatina Hidayati ◽  
Anis Zubair ◽  
Aditya Hidayat Pratama ◽  
Luthfi Indana

Clustering merupakan proses pengelompokan sekumpulan data ke dalam klaster yang memiliki kemiripan. Kemiripan dalam satau klaster ditentukan dengan perhitungan jarak. Untuk melihat perfoma beberapa perhitungan jarak, dalam penelitian ini penulis menguji pada 6 data yang memiliki atribut berbeda, yakni 2, 3, 4, dan 6 atribut. Dari hasil uji perbandingan rumus jarak pada K-Means clustering menggunakan Silhouette coefficient dapat disimpulkan bahwa: 1) Chebyshev distance memiliki performa yang stabil baik untuk data dengan sedikit atribut maupun banyak. 2) Average distance memiliki hasil Silhouette coefficient paling tinggi dibandingkan dengan pengukuran jarak lain untuk data yang memiliki outliers seperti data 3. 3) Mean Character Difference mendapatkan hasil yang baik hanya untuk data dengan sedikit atribut. 4) Euclidean distance, Manhattan distance, dan Minkowski distance menghasilkan nilai baik untuk data yang memiliki sedikt atribut, sedangkan untuk data yang banyak atribut mendapatkan nilai cukup yang mendekati 0,5.


2020 ◽  
Vol ahead-of-print (ahead-of-print) ◽  
Author(s):  
Shumpei Haginoya ◽  
Aiko Hanayama ◽  
Tamae Koike

Purpose The purpose of this paper was to compare the accuracy of linking crimes using geographical proximity between three distance measures: Euclidean (distance measured by the length of a straight line between two locations), Manhattan (distance obtained by summing north-south distance and east-west distance) and the shortest route distances. Design/methodology/approach A total of 194 cases committed by 97 serial residential burglars in Aomori Prefecture in Japan between 2004 and 2015 were used in the present study. The Mann–Whitney U test was used to compare linked (two offenses committed by the same offender) and unlinked (two offenses committed by different offenders) pairs for each distance measure. Discrimination accuracy between linked and unlinked crime pairs was evaluated using area under the receiver operating characteristic curve (AUC). Findings The Mann–Whitney U test showed that the distances of the linked pairs were significantly shorter than those of the unlinked pairs for all distance measures. Comparison of the AUCs showed that the shortest route distance achieved significantly higher accuracy compared with the Euclidean distance, whereas there was no significant difference between the Euclidean and the Manhattan distance or between the Manhattan and the shortest route distance. These findings give partial support to the idea that distance measures taking the impact of environmental factors into consideration might be able to identify a crime series more accurately than Euclidean distances. Research limitations/implications Although the results suggested a difference between the Euclidean and the shortest route distance, it was small, and all distance measures resulted in outstanding AUC values, probably because of the ceiling effects. Further investigation that makes the same comparison in a narrower area is needed to avoid this potential inflation of discrimination accuracy. Practical implications The shortest route distance might contribute to improving the accuracy of crime linkage based on geographical proximity. However, further investigation is needed to recommend using the shortest route distance in practice. Given that the targeted area in the present study was relatively large, the findings may contribute especially to improve the accuracy of proactive comparative case analysis for estimating the whole picture of the distribution of serial crimes in the region by selecting more effective distance measure. Social implications Implications to improve the accuracy in linking crimes may contribute to assisting crime investigations and the earlier arrest of offenders. Originality/value The results of the present study provide an initial indication of the efficacy of using distance measures taking environmental factors into account.


Author(s):  
Qibin Zhou ◽  
Qinggang Su ◽  
Peng Xiong

The assisted download is an effective method solving the problem that the coverage range is insufficient when Wi-Fi access is used in VANET. For the low utilization of time-space resource within blind area and unbalanced download services in VANET, this paper proposes an approximate global optimum scheme to select vehicle based on WebGIS for assistance download. For WebGIS, this scheme uses a two-dimensional matrix to respectively define the time-space resource and the vehicle selecting behavior, and uses Markov Decision Process to solve the problem of time-space resource allocation within blind area, and utilizes the communication features of VANET to simplify the behavior space of vehicle selection so as to reduce the computing complexity. At the same time, Euclidean Distance(Metric) and Manhattan Distance are used as the basis of vehicle selection by the proposed scheme so that, in the case of possessing the balanced assisted download services, the target vehicles can increase effectively the total amount of user downloads. Experimental results show that because of the wider access range and platform independence of WebGIS, when user is in the case of relatively balanced download services, the total amount of downloads is increased by more than 20%. Moreover, WebGIS usually only needs to use Web browser (sometimes add some plug-ins) on the client side, so the system cost is greatly reduced.


2021 ◽  
Vol 25 (01) ◽  
pp. 80-91
Author(s):  
Saba K. Naji ◽  
◽  
Muthana H. Hamd ◽  

Due to, the great electronic development, which reinforced the need to define people's identities, different methods, and databases to identification people's identities have emerged. In this paper, we compare the results of two texture analysis methods: Local Binary Pattern (LBP) and Local Ternary Pattern (LTP). The comparison based on comparing the extracting facial texture features of 40 and 401 subjects taken from ORL and UFI databases respectively. As well, the comparison has taken in the account using three distance measurements such as; Manhattan Distance (MD), Euclidean Distance (ED), and Cosine Distance (CD). Where the maximum accuracy of the LBP method (99.23%) is obtained with a Manhattan and ORL database, while the LTP method attained (98.76%) using the same distance and database. While, the facial database of UFI shows low quality, which is satisfied 75.98% and 73.82% recognition rates using LBP and LTP respectively with Manhattan distance.


2021 ◽  
Vol 17 (1) ◽  
pp. 74-91
Author(s):  
Neha Gupta ◽  
Sakshi Jolly

Data usually comes into data warehouses from multiple sources having different formats and are specifically categorized into three groups (i.e., structured, semi-structured, and unstructured). Various data mining technologies are used to collect, refine, and analyze the data which further leads to the problem of data quality management. Data purgation occurs when the data is subject to ETL methodology in order to maintain and improve the data quality. The data may contain unnecessary information and may have inappropriate symbols which can be defined as dummy values, cryptic values, or missing values. The present work has improved the expectation-maximization algorithm with dot product to handle cryptic data, DBSCAN method with Gower metrics to ensure dummy values, Wards algorithm with Minkowski distance to improve the results of contradicting data and K-means algorithm along with Euclidean distance metrics to handle missing values in a dataset. These distance metrics have improved the data quality and also helped in providing consistent data to be loaded into a data warehouse.


Author(s):  
Parag Jain

Most popular machine learning algorithms like k-nearest neighbour, k-means, SVM uses a metric to identify the distance(or similarity) between data instances. It is clear that performances of these algorithm heavily depends on the metric being used. In absence of prior knowledge about data we can only use general purpose metrics like Euclidean distance, Cosine similarity or Manhattan distance etc, but these metric often fail to capture the correct behaviour of data which directly affects the performance of the learning algorithm. Solution to this problem is to tune the metric according to the data and the problem, manually deriving the metric for high dimensional data which is often difficult to even visualize is not only tedious but is extremely difficult. Which leads to put effort on \textit{metric learning} which satisfies the data geometry.Goal of metric learning algorithm is to learn a metric which assigns small distance to similar points and relatively large distance to dissimilar points.


Sign in / Sign up

Export Citation Format

Share Document