Determining Cosine Similarity Neighborhoods by Means of the Euclidean Distance

Author(s):  
Marzena Kryszkiewicz
2019 ◽  
Vol 35 (13) ◽  
pp. 1400-1414 ◽  
Author(s):  
Miriam Rodrigues da Silva ◽  
Osmar Abílio de Carvalho ◽  
Renato Fontes Guimarães ◽  
Roberto Arnaldo Trancoso Gomes ◽  
Cristiano Rosa Silva

2018 ◽  
Vol 7 (4.44) ◽  
pp. 156
Author(s):  
Faisal Rahutomo ◽  
Trisna Ari Roshinta ◽  
Erfan Rohadi ◽  
Indrazno Siradjuddin ◽  
Rudy Ariyanto ◽  
...  

This paper presents open problems in Indonesian Scoring System. The previous study exposes the comparison of several similarity metrics on automated essay scoring in Indonesian. The metrics are Cosine Similarity, Euclidean Distance, and Jaccard. The data being used in the research are about 2,000 texts. This data are obtained from 50 students who answered 40 questions on politics, sports, lifestyle, and technology. The study also evaluates the stemming approach for the system performance. The difference between all methods between using stemming or not is around 4-9%. The results show Jaccard is the best metric both for the system with stemming or not. Jaccard method with stemming has the percentage error lowest than the others. The politic category has the highest average similarity score than lifestyle, sport, and technology. The percentage error of Jaccard with stemming is 52.31%, Cosine Similarity is 59.49%, and Euclidean Distance is 332.90%. In addition, Jaccard without stemming is also the best than the others. The percentage error without stemming of Jaccard is 56.05%, Cosine Similarity is 57.99%, and Euclidean Distance is 339.41%. However, this percentage error is high enough to be used for a functional essay grading system. The percentage errors are relatively high, more than 50%. Therefore this paper explores several ideas of open problems in this issue. The openly available dataset can be used to develop better approaches than the standard similarity metrics. The approaches expose are ranging from feature extraction, similarity metrics, learning algorithm, environment implementation, and performance evaluation.   


Author(s):  
Dengdi Sun ◽  
Chris Ding ◽  
Jin Tang ◽  
Bin Luo

Dimensionality reduction plays a vital role in pattern recognition. However, for normalized vector data, existing methods do not utilize the fact that the data is normalized. In this chapter, the authors propose to employ an Angular Decomposition of the normalized vector data which corresponds to embedding them on a unit surface. On graph data for similarity/kernel matrices with constant diagonal elements, the authors propose the Angular Decomposition of the similarity matrices which corresponds to embedding objects on a unit sphere. In these angular embeddings, the Euclidean distance is equivalent to the cosine similarity. Thus data structures best described in the cosine similarity and data structures best captured by the Euclidean distance can both be effectively detected in our angular embedding. The authors provide the theoretical analysis, derive the computational algorithm, and evaluate the angular embedding on several datasets. Experiments on data clustering demonstrate that the method can provide a more discriminative subspace.


2021 ◽  
Vol 58 (04) ◽  
pp. 1395-1403
Author(s):  
Ayesha Hakim

Acoustic recordings of birds have been used by conservationists and ecologists to determine the population density and bio- diversity of bird species in a region. However, it is hard to analyze and visualize the presence/absence of a specific bird species by aurally hearing these recordings even by an expert bird song specialist. In this paper, we present a computational tool to cluster and recognize bird species based on their sounds and visualize relationships of within-species and between-species sounds based on their similarity measures. The tool has been evaluated on two datasets of varying complexity containing acoustic recordings of eleven birds’ songs and calls using various similarity measures. Principal Component Analysis (PCA) was used for feature selection. Euclidean distance, Mahalanobis distance, and cosine similarity among features was used for pair-wise similarity calculation. The results of similarity measures have been compared using 3-fold cross-validation and validated by spectrograms patterns obtained from frequency representation of acoustic recordings of the selected birds’ songs and calls. Cosine similarity performed better to measure underlying patterns of birds’ sounds and identify mutual relationship among species. It was concluded that the proposed tool can be used as a novel method for conversationalists, ecologists, ornithologists, and evolutionary scientists as well as tourists and bird watchers to recognize different birds’ species, study their mutual relationship, locate the area with highest population density, estimating the predators, and biodiversity in a specific region.


2020 ◽  
Author(s):  
Cameron Hargreaves ◽  
Matthew Dyer ◽  
Michael Gaultois ◽  
Vitaliy Kurlin ◽  
Matthew J Rosseinsky

It is a core problem in any field to reliably tell how close two objects are to being the same, and once this relation has been established we can use this information to precisely quantify potential relationships, both analytically and with machine learning (ML). For inorganic solids, the chemical composition is a fundamental descriptor, which can be represented by assigning the ratio of each element in the material to a vector. These vectors are a convenient mathematical data structure for measuring similarity, but unfortunately, the standard metric (the Euclidean distance) gives little to no variance in the resultant distances between chemically dissimilar compositions. We present the Earth Mover’s Distance (EMD) for inorganic compositions, a well-defined metric which enables the measure of chemical similarity in an explainable fashion. We compute the EMD between two compositions from the ratio of each of the elements and the absolute distance between the elements on the modified Pettifor scale. This simple metric shows clear strength at distinguishing compounds and is efficient to compute in practice. The resultant distances have greater alignment with chemical understanding than the Euclidean distance, which is demonstrated on the binary compositions of the Inorganic Crystal Structure Database (ICSD). The EMD is a reliable numeric measure of chemical similarity that can be incorporated into automated workflows for a range of ML techniques. We have found that with no supervision the use of this metric gives a distinct partitioning of binary compounds into clear trends and families of chemical property, with future applications for nearest neighbor search queries in chemical database retrieval systems and supervised ML techniques.


Author(s):  
Luis Fernando Segalla ◽  
Alexandre Zabot ◽  
Diogo Nardelli Siebert ◽  
Fabiano Wolf

Author(s):  
Tu Huynh-Kha ◽  
Thuong Le-Tien ◽  
Synh Ha ◽  
Khoa Huynh-Van

This research work develops a new method to detect the forgery in image by combining the Wavelet transform and modified Zernike Moments (MZMs) in which the features are defined from more pixels than in traditional Zernike Moments. The tested image is firstly converted to grayscale and applied one level Discrete Wavelet Transform (DWT) to reduce the size of image by a half in both sides. The approximation sub-band (LL), which is used for processing, is then divided into overlapping blocks and modified Zernike moments are calculated in each block as feature vectors. More pixels are considered, more sufficient features are extracted. Lexicographical sorting and correlation coefficients computation on feature vectors are next steps to find the similar blocks. The purpose of applying DWT to reduce the dimension of the image before using Zernike moments with updated coefficients is to improve the computational time and increase exactness in detection. Copied or duplicated parts will be detected as traces of copy-move forgery manipulation based on a threshold of correlation coefficients and confirmed exactly from the constraint of Euclidean distance. Comparisons results between proposed method and related ones prove the feasibility and efficiency of the proposed algorithm.


Sign in / Sign up

Export Citation Format

Share Document