The Exploitation of Distance Distributions for Clustering

Although distance measures are used in many machine learning algorithms, the literature on the context-independent selection and evaluation of distance measures is limited in the sense that prior knowledge is used. In cluster analysis, current studies evaluate the choice of distance measure after applying unsupervised methods based on error probabilities, implicitly setting the goal of reproducing predefined partitions in data. Such studies use clusters of data that are often based on the context of the data as well as the custom goal of the specific study. Depending on the data context, different properties for distance distributions are judged to be relevant for appropriate distance selection. However, if cluster analysis is based on the task of finding similar partitions of data, then the intrapartition distances should be smaller than the interpartition distances. By systematically investigating this specification using distribution analysis through the mirrored-density (MD plot), it is shown that multimodal distance distributions are preferable in cluster analysis. As a consequence, it is advantageous to model distance distributions with Gaussian mixtures prior to the evaluation phase of unsupervised methods. Experiments are performed on several artificial datasets and natural datasets for the task of clustering.

Download Full-text

Comparing distance measures on assessed medical device incident data using Average Silhouette Width

Current Directions in Biomedical Engineering ◽

10.1515/cdbme-2018-0126 ◽

2018 ◽

Vol 4 (1) ◽

pp. 525-528

Author(s):

Christian Bayer ◽

Robin Seidel

Keyword(s):

Distance Measure ◽

Data Preprocessing ◽

Study Data ◽

Machine Learning Algorithms ◽

Distance Measures ◽

Free Text ◽

Silhouette Width ◽

Federal Institute ◽

Cluster Density ◽

Incident Reports

AbstractMany machine learning algorithms depend on the choice of an appropriate similarity or distance measure. Comparing such measures in different domains and on diversely structured data is common, but often performed in regards of an algorithm to cluster or classify the data. In this study, data assessed by experts is analyzed instead. The data is taken from the database of the Federal Institute for Drugs and Medical Devices (BfArM) and represents free text incident reports. The Average Silhouette Width, a cluster density measure, is used to compare the distance measures’ ability to discriminate the data according to the experts’ assessments. The Euclidean distance and four distance measures derived from the Jaccard similarity, the Simple Matching similarity, the Cosine similarity and the Yule similarity are compared on four subsets of this database. The results show, that a better data preprocessing is necessary, possibly due to boilerplate texts being used to write incident reports. These results will also provide the basis to compare improvements by different methods of data preprocessing in the future.

Download Full-text

A Dynamic Distance Measure of Picture Fuzzy Sets and Its Application

Symmetry ◽

10.3390/sym13030436 ◽

2021 ◽

Vol 13 (3) ◽

pp. 436

Author(s):

Ruirui Zhao ◽

Minxia Luo ◽

Shenggang Li

Keyword(s):

Fuzzy Sets ◽

Distance Measure ◽

Distance Measures ◽

Numerical Comparison ◽

Mathematical Tool ◽

Practical Applications ◽

Fuzzy Point ◽

Point Operator ◽

The Difference ◽

Picture Fuzzy Sets

Picture fuzzy sets, which are the extension of intuitionistic fuzzy sets, can deal with inconsistent information better in practical applications. A distance measure is an important mathematical tool to calculate the difference degree between picture fuzzy sets. Although some distance measures of picture fuzzy sets have been constructed, there are some unreasonable and counterintuitive cases. The main reason is that the existing distance measures do not or seldom consider the refusal degree of picture fuzzy sets. In order to solve these unreasonable and counterintuitive cases, in this paper, we propose a dynamic distance measure of picture fuzzy sets based on a picture fuzzy point operator. Through a numerical comparison and multi-criteria decision-making problems, we show that the proposed distance measure is reasonable and effective.

Download Full-text

Comparison of Different Distance Measures for Cluster Analysis of Tree-Ring Series

Tree-Ring Research ◽

10.3959/2007-2.1 ◽

2008 ◽

Vol 64 (1) ◽

pp. 27-37 ◽

Cited By ~ 9

Author(s):

Ignacio. García-González

Keyword(s):

Cluster Analysis ◽

Tree Ring ◽

Distance Measures

Download Full-text

Linkage analysis using geographical proximity: a test of the efficacy of distance measures

Journal of Criminological Research Policy and Practice ◽

10.1108/jcrpp-01-2020-0006 ◽

2020 ◽

Vol ahead-of-print (ahead-of-print) ◽

Author(s):

Shumpei Haginoya ◽

Aiko Hanayama ◽

Tamae Koike

Keyword(s):

Environmental Factors ◽

Euclidean Distance ◽

Distance Measure ◽

Distance Measures ◽

Manhattan Distance ◽

Geographical Proximity ◽

Discrimination Accuracy ◽

Content Type ◽

Shortest Route ◽

The Impact

Purpose The purpose of this paper was to compare the accuracy of linking crimes using geographical proximity between three distance measures: Euclidean (distance measured by the length of a straight line between two locations), Manhattan (distance obtained by summing north-south distance and east-west distance) and the shortest route distances. Design/methodology/approach A total of 194 cases committed by 97 serial residential burglars in Aomori Prefecture in Japan between 2004 and 2015 were used in the present study. The Mann–Whitney U test was used to compare linked (two offenses committed by the same offender) and unlinked (two offenses committed by different offenders) pairs for each distance measure. Discrimination accuracy between linked and unlinked crime pairs was evaluated using area under the receiver operating characteristic curve (AUC). Findings The Mann–Whitney U test showed that the distances of the linked pairs were significantly shorter than those of the unlinked pairs for all distance measures. Comparison of the AUCs showed that the shortest route distance achieved significantly higher accuracy compared with the Euclidean distance, whereas there was no significant difference between the Euclidean and the Manhattan distance or between the Manhattan and the shortest route distance. These findings give partial support to the idea that distance measures taking the impact of environmental factors into consideration might be able to identify a crime series more accurately than Euclidean distances. Research limitations/implications Although the results suggested a difference between the Euclidean and the shortest route distance, it was small, and all distance measures resulted in outstanding AUC values, probably because of the ceiling effects. Further investigation that makes the same comparison in a narrower area is needed to avoid this potential inflation of discrimination accuracy. Practical implications The shortest route distance might contribute to improving the accuracy of crime linkage based on geographical proximity. However, further investigation is needed to recommend using the shortest route distance in practice. Given that the targeted area in the present study was relatively large, the findings may contribute especially to improve the accuracy of proactive comparative case analysis for estimating the whole picture of the distribution of serial crimes in the region by selecting more effective distance measure. Social implications Implications to improve the accuracy in linking crimes may contribute to assisting crime investigations and the earlier arrest of offenders. Originality/value The results of the present study provide an initial indication of the efficacy of using distance measures taking environmental factors into account.

Download Full-text

Graph diffusion distance: Properties and efficient computation

PLoS ONE ◽

10.1371/journal.pone.0249624 ◽

2021 ◽

Vol 16 (4) ◽

pp. e0249624

Author(s):

C. B. Scott ◽

Eric Mjolsness

Keyword(s):

Distance Measure ◽

Fixed Time ◽

Graph Laplacian ◽

Upper Bounds ◽

Distance Measures ◽

Distance Metrics ◽

Graph Products ◽

Efficient Computation ◽

New Family ◽

Sparsity Constraints

We define a new family of similarity and distance measures on graphs, and explore their theoretical properties in comparison to conventional distance metrics. These measures are defined by the solution(s) to an optimization problem which attempts find a map minimizing the discrepancy between two graph Laplacian exponential matrices, under norm-preserving and sparsity constraints. Variants of the distance metric are introduced to consider such optimized maps under sparsity constraints as well as fixed time-scaling between the two Laplacians. The objective function of this optimization is multimodal and has discontinuous slope, and is hence difficult for univariate optimizers to solve. We demonstrate a novel procedure for efficiently calculating these optima for two of our distance measure variants. We present numerical experiments demonstrating that (a) upper bounds of our distance metrics can be used to distinguish between lineages of related graphs; (b) our procedure is faster at finding the required optima, by as much as a factor of 103; and (c) the upper bounds satisfy the triangle inequality exactly under some assumptions and approximately under others. We also derive an upper bound for the distance between two graph products, in terms of the distance between the two pairs of factors. Additionally, we present several possible applications, including the construction of infinite “graph limits” by means of Cauchy sequences of graphs related to one another by our distance measure.

Download Full-text

Combining Regression and Clustering for Financial Analysis

Proceedings of the International Conference on Applied Statistics ◽

10.2478/icas-2021-0010 ◽

2020 ◽

Vol 2 (1) ◽

pp. 109-118

Author(s):

Andreea Ioana Chiriac

Keyword(s):

Artificial Intelligence ◽

Machine Learning ◽

Cluster Analysis ◽

Business Processes ◽

Financial Analysis ◽

Two Dimensions ◽

Machine Learning Algorithms ◽

Financial Ratio ◽

Research Perspective ◽

Systems Learning

Abstract Artificial Intelligence is used in business through machine learning algorithms. Machine learning is a part of computer science focused on computer systems learning to perform a specific task without using explicit instructions, relying on patterns and inference instead. Though it might seem like we’ve come a long way in the last ten years, which is true from a research perspective, the adoption of AI among corporations is still relatively low. Over time it became possible to automate more tasks and business processes than ever before. The benefit of using artificial intelligence is that does not require to program every step of the process, predicting at each step what could happen and how to resolve it. The algorithms decide for themselves in each case how the problems should be solved, based on the data that is used. I apply Python language to create a synthetic feature vector that allows visualizations in two dimensions for EDIBTA financial ratio. I use Mean-Square Error in order to evaluate the success, having the optimal parameters. In this section, I also mentioned about the purpose, goals, and applications of cluster analysis. I indicated about the basics of cluster analysis and how to do it and also did a demonstration on how to use K-Means.

Download Full-text

INDUCED AND LOGARITHMIC DISTANCES WITH MULTI-REGION AGGREGATION OPERATORS

Technological and Economic Development of Economy ◽

10.3846/tede.2019.9382 ◽

2019 ◽

Vol 0 (0) ◽

pp. 1-29 ◽

Cited By ~ 3

Author(s):

Víctor G. Alfaro-García ◽

José M. Merigó ◽

Leobardo Plata-Pérez ◽

Gerardo G. Alfaro-Calderón ◽

Anna M. Gil-Lafuente

Keyword(s):

Distance Measure ◽

Distance Measures ◽

Weighted Averaging ◽

Averaging Operators ◽

Heterogeneous Information ◽

Weighting Vector ◽

Choquet Integrals ◽

World Information ◽

Logarithmic Distance ◽

Region Aggregation

This paper introduces the induced ordered weighted logarithmic averaging IOWLAD and multiregion induced ordered weighted logarithmic averaging MR-IOWLAD operators. The distinctive characteristic of these operators lies in the notion of distance measures combined with the complex reordering mechanism of inducing variables and the properties of the logarithmic averaging operators. The main advantage of MR-IOWLAD operators is their design, which is specifically thought to aid in decision-making when a set of diverse regions with different properties must be considered. Moreover, the induced weighting vector and the distance measure mechanisms of the operator allow for the wider modeling of problems, including heterogeneous information and the complex attitudinal character of experts, when aiming for an ideal scenario. Along with analyzing the main properties of the IOWLAD operators, their families and specific cases, we also introduce some extensions, such as the induced generalized ordered weighted averaging IGOWLAD operator and Choquet integrals. We present the induced Choquet logarithmic distance averaging ICLD operator and the generalized induced Choquet logarithmic distance averaging IGCLD operator. Finally, an illustrative example is proposed, including real-world information retrieved from the United Nations World Statistics for global regions.

Download Full-text

Techniques for Fast Screening of 3D Heterogeneous Shale Barrier Configurations and Their Impacts on SAGD Chamber Development

SPE Journal ◽

10.2118/199906-pa ◽

2021 ◽

pp. 1-25

Author(s):

Chang Gao ◽

Juliana Y. Leung

Keyword(s):

Distance Measure ◽

Flow Simulation ◽

Training Data ◽

Distance Measures ◽

Data Driven ◽

Data Set ◽

Flow Simulations ◽

Steam Chamber ◽

Reservoir Models ◽

Tracking Model

Summary The steam-assisted gravity drainage (SAGD) recovery process is strongly impacted by the spatial distributions of heterogeneous shale barriers. Though detailed compositional flow simulators are available for SAGD recovery performance evaluation, the simulation process is usually quite computationally demanding, rendering their use over a large number of reservoir models for assessing the impacts of heterogeneity (uncertainties) to be impractical. In recent years, data-driven proxies have been widely proposed to reduce the computational effort; nevertheless, the proxy must be trained using a large data set consisting of many flow simulation cases that are ideally spanning the model parameter spaces. The question remains: is there a more efficient way to screen a large number of heterogeneous SAGD models? Such techniques could help to construct a training data set with less redundancy; they can also be used to quickly identify a subset of heterogeneous models for detailed flow simulation. In this work, we formulated two particular distance measures, flow-based and static-based, to quantify the similarity among a set of 3D heterogeneous SAGD models. First, to formulate the flow-based distance measure, a physics-basedparticle-tracking model is used: Darcy’s law and energy balance are integrated to mimic the steam chamber expansion process; steam particles that are located at the edge of the chamber would release their energy to the surrounding cold bitumen, while detailed fluid displacements are not explicitly simulated. The steam chamber evolution is modeled, and a flow-based distance between two given reservoir models is defined as the difference in their chamber sizes over time. Second, to formulate the static-based distance, the Hausdorff distance (Hausdorff 1914) is used: it is often used in image processing to compare two images according to their corresponding spatial arrangement and shapes of various objects. A suite of 3D models is constructed using representative petrophysical properties and operating constraints extracted from several pads in Suncor Energy’s Firebag project. The computed distance measures are used to partition the models into different groups. To establish a baseline for comparison, flow simulations are performed on these models to predict the actual chamber evolution and production profiles. The grouping results according to the proposed flow- and static-based distance measures match reasonably well to those obtained from detailed flow simulations. Significant improvement in computational efficiency is achieved with the proposed techniques. They can be used to efficiently screen a large number of reservoir models and facilitate the clustering of these models into groups with distinct shale heterogeneity characteristics. It presents a significant potential to be integrated with other data-driven approaches for reducing the computational load typically associated with detailed flow simulations involving multiple heterogeneous reservoir realizations.

Download Full-text

Enhanced Bio-Inspired Algorithms for Detecting and Filtering Spam

Global Implications of Emerging Technology Trends - Advances in IT Standards and Standardization Research ◽

10.4018/978-1-5225-4944-4.ch011 ◽

2018 ◽

pp. 179-215

Author(s):

Hadj Ahmed Bouarara

Keyword(s):

Electronic Communication ◽

Distance Measure ◽

Machine Learning Algorithms ◽

Spam Detection ◽

Social Bees ◽

Representation Technique ◽

Sensitive Parameters ◽

Measure Entropy ◽

Email Spam ◽

F Measure

The internet era promotes electronic commerce and facilitates access to many services. In today's digital society, the explosion in communication has revolutionized the field of electronic communication. Unfortunately, this technology has become incontestably the original source of malicious activities, especially the plague called undesirables email (SPAM) that has grown tremendously in the last few years. This chapter unveils fresh bio-inspired techniques (artificial social cockroaches [ASC], artificial haemostasis system [AHS], and artificial heart lungs system [AHLS]) and their application for SPAM detection. For the experimentation, the authors used the benchmark SMS Spam corpus V.0.1 and the validation measures (recall, precision, f-measure, entropy, accuracy, and error). They optimize the sensitive parameters of each algorithm (text representation technique, distance measure, weightings, and threshold). The results are positive compared to the result of artificial social bees and machine-learning algorithms (decision tree C4.5 and K-means).

Download Full-text

The Influence of Geographic and Psychic Distance on Online Hotel Ratings

Journal of Travel Research ◽

10.1177/0047287519858400 ◽

2019 ◽

Vol 59 (4) ◽

pp. 722-741 ◽

Cited By ~ 2

Author(s):

Paul Phillips ◽

Nuno Antonio ◽

Ana de Almeida ◽

Luís Nunes

Keyword(s):

Text Mining ◽

Distance Measure ◽

Country Of Origin ◽

Geographic Distance ◽

Distance Measures ◽

Review Author ◽

Psychic Distance ◽

Rating Score ◽

Data Set ◽

The Relationship

This study examines the relationship between distance measures and a Portuguese data set consisting of 34,622 online hotel reviews extracted from Booking.com and TripAdvisor written in Portuguese, Spanish, and English. Based on the country of origin of each review author, a geographic and a psychic distance measure is calculated for Portugal. Data and text mining analysis provides additional insights into online hotel ratings. The authors confirm that online travelers’ evaluations are multifaceted constructs displaying varying patterns of rating behavior among the traveler base. By investigating the contemporary relevance of geographic and psychic distance, a key finding of this study is that travelers with less distance both in terms of psychic and geographic distance give a lower rating score than travelers with greater distance. The inclusion of psychic and geographic distance is advocated as a salient aspect for future researchers and for those practitioners who wish to enhance hotel product and service features.

Download Full-text