Perbandingan Akurasi Euclidean Distance, Minkowski Distance, dan Manhattan Distance pada Algoritma K-Means Clustering berbasis Chi-Square

AbstractThe fuzzy k-nearest neighbor (FKNN) algorithm, one of the most well-known and effective supervised learning techniques, has often been used in data classification problems but rarely in regression settings. This paper introduces a new, more general fuzzy k-nearest neighbor regression model. Generalization is based on the usage of the Minkowski distance instead of the usual Euclidean distance. The Euclidean distance is often not the optimal choice for practical problems, and better results can be obtained by generalizing this. Using the Minkowski distance allows the proposed method to obtain more reasonable nearest neighbors to the target sample. Another key advantage of this method is that the nearest neighbors are weighted by fuzzy weights based on their similarity to the target sample, leading to the most accurate prediction through a weighted average. The performance of the proposed method is tested with eight real-world datasets from different fields and benchmarked to the k-nearest neighbor and three other state-of-the-art regression methods. The Manhattan distance- and Euclidean distance-based FKNNreg methods are also implemented, and the results are compared. The empirical results show that the proposed Minkowski distance-based fuzzy regression (Md-FKNNreg) method outperforms the benchmarks and can be a good algorithm for regression problems. In particular, the Md-FKNNreg model gave the significantly lowest overall average root mean square error (0.0769) of all other regression methods used. As a special case of the Minkowski distance, the Manhattan distance yielded the optimal conditions for Md-FKNNreg and achieved the best performance for most of the datasets.

Download Full-text

Local Triangular Kernel-Based Clustering (LTKC) for Case Indexing on Case-Based Reasoning

IJCCS (Indonesian Journal of Computing and Cybernetics Systems) ◽

10.22146/ijccs.30423 ◽

2018 ◽

Vol 12 (2) ◽

pp. 139

Author(s):

Damar Riyadi ◽

Aina Musdholifah

Keyword(s):

Heart Disease ◽

Thyroid Disease ◽

Euclidean Distance ◽

Manhattan Distance ◽

Case Based Reasoning ◽

Minkowski Distance ◽

Retrieval Time ◽

Accuracy Comparison ◽

Speed Up ◽

Case Based

This study aims to improve the performance of Case-Based Reasoning by utilizing cluster analysis which is used as an indexing method to speed up case retrieval in CBR. The clustering method uses Local Triangular Kernel-based Clustering (LTKC). The cosine coefficient method is used for finding the relevant cluster while similarity value is calculated using Manhattan distance, Euclidean distance, and Minkowski distance. Results of those methods will be compared to find which method gives the best result. This study uses three test data: malnutrition disease, heart disease, and thyroid disease. Test results showed that CBR with LTKC-indexing has better accuracy and processing time than CBR without indexing. The best accuracy on threshold 0.9 of malnutrition disease, obtained using the Euclidean distance which produces 100% accuracy and 0.0722 seconds average retrieval time. The best accuracy on threshold 0.9 of heart disease, obtained using the Minkowski distance which produces 95% accuracy and 0.1785 seconds average retrieval time. The best accuracy on threshold 0.9 of thyroid disease, obtained using the Minkowski distance which produces 92.52% accuracy and 0.3045 average retrieval time. The accuracy comparison of CBR with SOM-indexing, DBSCAN-indexing, and LTKC-indexing for malnutrition diseases and heart disease resulted that they have almost equal accuracy.

Download Full-text

Analisis Silhouette Coefficient pada 6 Perhitungan Jarak K-Means Clustering

Techno Com ◽

10.33633/tc.v20i2.4556 ◽

2021 ◽

Vol 20 (2) ◽

pp. 186-197

Author(s):

Rahmatina Hidayati ◽

Anis Zubair ◽

Aditya Hidayat Pratama ◽

Luthfi Indana

Keyword(s):

Euclidean Distance ◽

Average Distance ◽

Manhattan Distance ◽

Minkowski Distance ◽

Silhouette Coefficient

Clustering merupakan proses pengelompokan sekumpulan data ke dalam klaster yang memiliki kemiripan. Kemiripan dalam satau klaster ditentukan dengan perhitungan jarak. Untuk melihat perfoma beberapa perhitungan jarak, dalam penelitian ini penulis menguji pada 6 data yang memiliki atribut berbeda, yakni 2, 3, 4, dan 6 atribut. Dari hasil uji perbandingan rumus jarak pada K-Means clustering menggunakan Silhouette coefficient dapat disimpulkan bahwa: 1) Chebyshev distance memiliki performa yang stabil baik untuk data dengan sedikit atribut maupun banyak. 2) Average distance memiliki hasil Silhouette coefficient paling tinggi dibandingkan dengan pengukuran jarak lain untuk data yang memiliki outliers seperti data 3. 3) Mean Character Difference mendapatkan hasil yang baik hanya untuk data dengan sedikit atribut. 4) Euclidean distance, Manhattan distance, dan Minkowski distance menghasilkan nilai baik untuk data yang memiliki sedikt atribut, sedangkan untuk data yang banyak atribut mendapatkan nilai cukup yang mendekati 0,5.

Download Full-text

Linkage analysis using geographical proximity: a test of the efficacy of distance measures

Journal of Criminological Research Policy and Practice ◽

10.1108/jcrpp-01-2020-0006 ◽

2020 ◽

Vol ahead-of-print (ahead-of-print) ◽

Author(s):

Shumpei Haginoya ◽

Aiko Hanayama ◽

Tamae Koike

Keyword(s):

Environmental Factors ◽

Euclidean Distance ◽

Distance Measure ◽

Distance Measures ◽

Manhattan Distance ◽

Geographical Proximity ◽

Discrimination Accuracy ◽

Content Type ◽

Shortest Route ◽

The Impact

Purpose The purpose of this paper was to compare the accuracy of linking crimes using geographical proximity between three distance measures: Euclidean (distance measured by the length of a straight line between two locations), Manhattan (distance obtained by summing north-south distance and east-west distance) and the shortest route distances. Design/methodology/approach A total of 194 cases committed by 97 serial residential burglars in Aomori Prefecture in Japan between 2004 and 2015 were used in the present study. The Mann–Whitney U test was used to compare linked (two offenses committed by the same offender) and unlinked (two offenses committed by different offenders) pairs for each distance measure. Discrimination accuracy between linked and unlinked crime pairs was evaluated using area under the receiver operating characteristic curve (AUC). Findings The Mann–Whitney U test showed that the distances of the linked pairs were significantly shorter than those of the unlinked pairs for all distance measures. Comparison of the AUCs showed that the shortest route distance achieved significantly higher accuracy compared with the Euclidean distance, whereas there was no significant difference between the Euclidean and the Manhattan distance or between the Manhattan and the shortest route distance. These findings give partial support to the idea that distance measures taking the impact of environmental factors into consideration might be able to identify a crime series more accurately than Euclidean distances. Research limitations/implications Although the results suggested a difference between the Euclidean and the shortest route distance, it was small, and all distance measures resulted in outstanding AUC values, probably because of the ceiling effects. Further investigation that makes the same comparison in a narrower area is needed to avoid this potential inflation of discrimination accuracy. Practical implications The shortest route distance might contribute to improving the accuracy of crime linkage based on geographical proximity. However, further investigation is needed to recommend using the shortest route distance in practice. Given that the targeted area in the present study was relatively large, the findings may contribute especially to improve the accuracy of proactive comparative case analysis for estimating the whole picture of the distribution of serial crimes in the region by selecting more effective distance measure. Social implications Implications to improve the accuracy in linking crimes may contribute to assisting crime investigations and the earlier arrest of offenders. Originality/value The results of the present study provide an initial indication of the efficacy of using distance measures taking environmental factors into account.

Download Full-text

A Scheme of Selecting Vehicles to Assist Download Based on WebGIS for VANET

Journal of Web Engineering ◽

10.13052/jwe1540-9589.2129 ◽

2022 ◽

Author(s):

Qibin Zhou ◽

Qinggang Su ◽

Peng Xiong

Keyword(s):

Euclidean Distance ◽

Global Optimum ◽

Manhattan Distance ◽

Web Browser ◽

Problem Of Time ◽

Time Space ◽

Blind Area ◽

Markov Decision ◽

Platform Independence ◽

Client Side

The assisted download is an effective method solving the problem that the coverage range is insufficient when Wi-Fi access is used in VANET. For the low utilization of time-space resource within blind area and unbalanced download services in VANET, this paper proposes an approximate global optimum scheme to select vehicle based on WebGIS for assistance download. For WebGIS, this scheme uses a two-dimensional matrix to respectively define the time-space resource and the vehicle selecting behavior, and uses Markov Decision Process to solve the problem of time-space resource allocation within blind area, and utilizes the communication features of VANET to simplify the behavior space of vehicle selection so as to reduce the computing complexity. At the same time, Euclidean Distance(Metric) and Manhattan Distance are used as the basis of vehicle selection by the proposed scheme so that, in the case of possessing the balanced assisted download services, the target vehicles can increase effectively the total amount of user downloads. Experimental results show that because of the wider access range and platform independence of WebGIS, when user is in the case of relatively balanced download services, the total amount of downloads is increased by more than 20%. Moreover, WebGIS usually only needs to use Web browser (sometimes add some plug-ins) on the client side, so the system cost is greatly reduced.

Download Full-text

HUMAN IDENTIFICATION BASED ON FACE RECOGNITION SYSTEM

Journal of Engineering and Sustainable Development ◽

10.31272/jeasd.25.1.7 ◽

2021 ◽

Vol 25 (01) ◽

pp. 80-91

Author(s):

Saba K. Naji ◽

◽

Muthana H. Hamd ◽

Keyword(s):

Face Recognition ◽

Euclidean Distance ◽

Local Binary Pattern ◽

Texture Features ◽

Recognition System ◽

Manhattan Distance ◽

Distance Measurements ◽

Local Ternary Pattern ◽

Face Recognition System ◽

Cosine Distance

Due to, the great electronic development, which reinforced the need to define people's identities, different methods, and databases to identification people's identities have emerged. In this paper, we compare the results of two texture analysis methods: Local Binary Pattern (LBP) and Local Ternary Pattern (LTP). The comparison based on comparing the extracting facial texture features of 40 and 401 subjects taken from ORL and UFI databases respectively. As well, the comparison has taken in the account using three distance measurements such as; Manhattan Distance (MD), Euclidean Distance (ED), and Cosine Distance (CD). Where the maximum accuracy of the LBP method (99.23%) is obtained with a Manhattan and ORL database, while the LTP method attained (98.76%) using the same distance and database. While, the facial database of UFI shows low quality, which is satisfied 75.98% and 73.82% recognition rates using LBP and LTP respectively with Manhattan distance.

Download Full-text

Enhancing Data Quality at ETL Stage of Data Warehousing

International Journal of Data Warehousing and Mining ◽

10.4018/ijdwm.2021010105 ◽

2021 ◽

Vol 17 (1) ◽

pp. 74-91

Author(s):

Neha Gupta ◽

Sakshi Jolly

Keyword(s):

Data Quality ◽

Data Warehouse ◽

Expectation Maximization ◽

Euclidean Distance ◽

Missing Values ◽

Expectation Maximization Algorithm ◽

Distance Metrics ◽

Multiple Sources ◽

Minkowski Distance ◽

Data Quality Management

Data usually comes into data warehouses from multiple sources having different formats and are specifically categorized into three groups (i.e., structured, semi-structured, and unstructured). Various data mining technologies are used to collect, refine, and analyze the data which further leads to the problem of data quality management. Data purgation occurs when the data is subject to ETL methodology in order to maintain and improve the data quality. The data may contain unnecessary information and may have inappropriate symbols which can be defined as dummy values, cryptic values, or missing values. The present work has improved the expectation-maximization algorithm with dot product to handle cryptic data, DBSCAN method with Gower metrics to ensure dummy values, Wards algorithm with Minkowski distance to improve the results of contradicting data and K-means algorithm along with Euclidean distance metrics to handle missing values in a dataset. These distance metrics have improved the data quality and also helped in providing consistent data to be loaded into a data warehouse.

Download Full-text

Comparative Analysis of Inter-Centroid K-Means Performance using Euclidean Distance, Canberra Distance and Manhattan Distance

Journal of Physics Conference Series ◽

10.1088/1742-6596/1566/1/012112 ◽

2020 ◽

Vol 1566 ◽

pp. 012112 ◽

Cited By ~ 1

Author(s):

M Faisal ◽

E M Zamzami ◽

Sutarman

Keyword(s):

Comparative Analysis ◽

Euclidean Distance ◽

Manhattan Distance ◽

Canberra Distance

Download Full-text

Analysis of euclidean distance and manhattan distance measure in face recognition

Third International Conference on Computational Intelligence and Information Technology (CIIT 2013) ◽

10.1049/cp.2013.2636 ◽

2013 ◽

Cited By ~ 7

Author(s):

M.D. Malkauthekar

Keyword(s):

Face Recognition ◽

Euclidean Distance ◽

Distance Measure ◽

Manhattan Distance

Download Full-text

Metric Learning Tutorial

10.20944/preprints201809.0131.v1 ◽

2018 ◽

Author(s):

Parag Jain

Keyword(s):

Machine Learning ◽

Euclidean Distance ◽

Learning Algorithm ◽

Metric Learning ◽

General Purpose ◽

Small Distance ◽

Machine Learning Algorithms ◽

High Dimensional ◽

Manhattan Distance ◽

Nearest Neighbour

Most popular machine learning algorithms like k-nearest neighbour, k-means, SVM uses a metric to identify the distance(or similarity) between data instances. It is clear that performances of these algorithm heavily depends on the metric being used. In absence of prior knowledge about data we can only use general purpose metrics like Euclidean distance, Cosine similarity or Manhattan distance etc, but these metric often fail to capture the correct behaviour of data which directly affects the performance of the learning algorithm. Solution to this problem is to tune the metric according to the data and the problem, manually deriving the metric for high dimensional data which is often difficult to even visualize is not only tedious but is extremely difficult. Which leads to put effort on \textit{metric learning} which satisfies the data geometry.Goal of metric learning algorithm is to learn a metric which assigns small distance to similar points and relatively large distance to dissimilar points.

Download Full-text