Distributed Mining of Outliers from Large Multi-Dimensional Databases

A data point is given dataset is considered to be outlier when it is not distant to all its nearest neighbours. Obviously it is based on distance measure. However, in distributed environments it is challenging to detect outliers. Many approaches to mine outliers such environments came into existence. However, a faster and more efficient way is desired. In this paper we employ a novel index tree which is hierarchical in nature. Its hierarchical structure paves way for space pruning while its clustering property helps in faster search of finding neighbours of a given data point. Its time complexity is linear to the size of dataset and its dimensions. On top of the hierarchical tree (Hi-tree) nearest neighbour search avoids unnecessary computations besides pruning unpromising points. An algorithm by name Distributed Mining of Outliers using Hi-tree (DMOH) is proposed. The index tree can be exploited with parallel processing phenomenon. We built a prototype application to demonstrate proof of the concept. Our empirical study revealed the efficiency of the proposed algorithm on top of Hi-tree.

Download Full-text

A computationally efficient estimator for mutual information

Proceedings of The Royal Society A Mathematical Physical and Engineering Sciences ◽

10.1098/rspa.2007.0196 ◽

2008 ◽

Vol 464 (2093) ◽

pp. 1203-1215 ◽

Cited By ~ 16

Author(s):

Dafydd Evans

Keyword(s):

Data Analysis ◽

Mutual Information ◽

Time Complexity ◽

Exploratory Data Analysis ◽

Nearest Neighbour ◽

Computationally Efficient ◽

One Dimensional ◽

Exploratory Data ◽

Efficient Alternative ◽

Computationally Expensive

Mutual information quantifies the determinism that exists in a relationship between random variables, and thus plays an important role in exploratory data analysis. We investigate a class of non-parametric estimators for mutual information, based on the nearest neighbour structure of observations in both the joint and marginal spaces. Unless both marginal spaces are one-dimensional, we demonstrate that a well-known estimator of this type can be computationally expensive under certain conditions, and propose a computationally efficient alternative that has a time complexity of order ( N log N ) as the number of observations N →∞.

Download Full-text

Extension of mathematical background for Nearest Neighbour Analysis in three-dimensional space

Geoinformatics FCE CTU ◽

10.14311/gi.11.2 ◽

2013 ◽

Vol 11 ◽

pp. 25-36

Author(s):

Eva Stopková

Keyword(s):

Dimensional Space ◽

Average Distance ◽

Three Dimensional ◽

Nearest Neighbour ◽

Mathematical Background ◽

Area Of Interest ◽

Nearest Neighbours ◽

Anisotropic Function ◽

Neighbour Analysis ◽

Three Dimensional Space

Proceeding deals with development and testing of the module for GRASS GIS [1], based on Nearest Neighbour Analysis. This method can be useful for assessing whether points located in area of interest are distributed randomly, in clusters or separately. The main principle of the method consists of comparing observed average distance between the nearest neighbours r A to average distance between the nearest neighbours r E that is expected in case of randomly distributed points. The result should be statistically tested. The method for two- or three-dimensional space differs in way how to compute r E . Proceeding also describes extension of mathematical background deriving standard deviation of r E , needed in statistical test of analysis result. As disposition of phenomena (e.g. distribution of birds’ nests or plant species) and test results suggest, anisotropic function would repre- sent relationships between points in three-dimensional space better than isotropic function that was used in this work.

Download Full-text

The Perception of Minimal Structures: Performance on Open and Closed Versions of Visually Presented Euclidean Travelling Salesperson Problems

Perception ◽

10.1068/p3416 ◽

2003 ◽

Vol 32 (7) ◽

pp. 871-886 ◽

Cited By ~ 22

Author(s):

Douglas Vickers ◽

Pierre Bovet ◽

Michael D Lee ◽

Peter Hughes

Keyword(s):

Minimal Length ◽

Nearest Neighbour ◽

Optimal Solutions ◽

Response Uncertainty ◽

Structure And Motion ◽

Nearest Neighbours ◽

And Performance ◽

Computational Intractability ◽

Travelling Salesperson Problem

The planar Euclidean version of the travelling salesperson problem (TSP) requires finding a tour of minimal length through a two-dimensional set of nodes. Despite the computational intractability of the TSP, people can produce rapid, near-optimal solutions to visually presented versions of such problems. To explain this, MacGregor et al (1999, Perception28 1417–1428) have suggested that people use a global-to-local process, based on a perceptual tendency to organise stimuli into convex figures. We review the evidence for this idea and propose an alternative, local-to-global hypothesis, based on the detection of least distances between the nodes in an array. We present the results of an experiment in which we examined the relationships between three objective measures and performance measures of optimality and response uncertainty in tasks requiring participants to construct a closed tour or an open path. The data are not well accounted for by a process based on the convex hull. In contrast, results are generally consistent with a locally focused process based initially on the detection of nearest-neighbour clusters. Individual differences are interpreted in terms of a hierarchical process of constructing solutions, and the findings are related to a more general analysis of the role of nearest neighbours in the perception of structure and motion.

Download Full-text

Nearest-Neighbour-Induced Isolation Similarity and Its Impact on Density-Based Clustering

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v33i01.33014755 ◽

2019 ◽

Vol 33 ◽

pp. 4755-4762 ◽

Cited By ~ 3

Author(s):

Xiaoyu Qin ◽

Kai Ming Ting ◽

Ye Zhu ◽

Vincent CS Lee

Keyword(s):

Clustering Algorithm ◽

Distance Measure ◽

Nearest Neighbour ◽

Density Peak ◽

Density Based Clustering ◽

New Type ◽

Density Peak Clustering ◽

The Impact ◽

First Time ◽

Tree Method

A recent proposal of data dependent similarity called Isolation Kernel/Similarity has enabled SVM to produce better classification accuracy. We identify shortcomings of using a tree method to implement Isolation Similarity; and propose a nearest neighbour method instead. We formally prove the characteristic of Isolation Similarity with the use of the proposed method. The impact of Isolation Similarity on densitybased clustering is studied here. We show for the first time that the clustering performance of the classic density-based clustering algorithm DBSCAN can be significantly uplifted to surpass that of the recent density-peak clustering algorithm DP. This is achieved by simply replacing the distance measure with the proposed nearest-neighbour-induced Isolation Similarity in DBSCAN, leaving the rest of the procedure unchanged. A new type of clusters called mass-connected clusters is formally defined. We show that DBSCAN, which detects density-connected clusters, becomes one which detects mass-connected clusters, when the distance measure is replaced with the proposed similarity. We also provide the condition under which mass-connected clusters can be detected, while density-connected clusters cannot.

Download Full-text

Query-specific signature selection for efficient k-nearest neighbour approximation

Journal of Information Science ◽

10.1177/0165551516644176 ◽

2016 ◽

Vol 43 (4) ◽

pp. 440-457

Author(s):

Youngki Park ◽

Heasoo Hwang ◽

Sang-goo Lee

Keyword(s):

Computational Cost ◽

Optimization Techniques ◽

Locality Sensitive Hashing ◽

High Dimensional ◽

Query Point ◽

Nearest Neighbour ◽

Multiple Queries ◽

Large Pool ◽

Nearest Neighbours ◽

Selection For

Finding k-nearest neighbours ( k-NN) is one of the most important primitives of many applications such as search engines and recommendation systems. However, its computational cost is extremely high when searching for k-NN points in a huge collection of high-dimensional points. Locality-sensitive hashing (LSH) has been introduced for an efficient k-NN approximation, but none of the existing LSH approaches clearly outperforms others. We propose a novel LSH approach, Signature Selection LSH (S2LSH), which finds approximate k-NN points very efficiently in various datasets. It first constructs a large pool of highly diversified signature regions with various sizes. Given a query point, it dynamically generates a query-specific signature region by merging highly effective signature regions selected from the signature pool. We also suggest S2LSH-M, a variant of S2LSH, which processes multiple queries more efficiently by using query-specific features and optimization techniques. Extensive experiments show the performance superiority of our approaches in diverse settings.

Download Full-text

A law of large numbers for nearest neighbour statistics

Proceedings of The Royal Society A Mathematical Physical and Engineering Sciences ◽

10.1098/rspa.2008.0235 ◽

2008 ◽

Vol 464 (2100) ◽

pp. 3175-3192 ◽

Cited By ~ 13

Author(s):

Dafydd Evans

Keyword(s):

Law Of Large Numbers ◽

Random Variables ◽

Nearest Neighbour ◽

Near Neighbour ◽

Strong Law ◽

Large Numbers ◽

Nearest Neighbours ◽

Expected Values ◽

Sample Points ◽

Data Analysis Methods

In practical data analysis, methods based on proximity (near-neighbour) relationships between sample points are important because these relations can be computed in time ( n log n ) as the number of points n →∞. Associated with such methods are a class of random variables defined to be functions of a given point and its nearest neighbours in the sample. If the sample points are independent and identically distributed, the associated random variables will also be identically distributed but not independent. Despite this, we show that random variables of this type satisfy a strong law of large numbers, in the sense that their sample means converge to their expected values almost surely as the number of sample points n →∞.

Download Full-text

A Hierarchical Tree Distance Measure for Classification

Proceedings of the 6th International Conference on Pattern Recognition Applications and Methods ◽

10.5220/0006198505020509 ◽

2017 ◽

Author(s):

Kent Munthe Caspersen ◽

Martin Bjeldbak Madsen ◽

Andreas Berre Eriksen ◽

Bo Thiesson

Keyword(s):

Distance Measure ◽

Hierarchical Tree ◽

Tree Distance

Download Full-text

Influence of planting seed tubers with gangrene(Phoma foveata) and of neighbouring healthy, diseased and missing plants on the yield and size of potatoes

The Journal of Agricultural Science ◽

10.1017/s0021859600074499 ◽

1995 ◽

Vol 125 (1) ◽

pp. 51-60 ◽

Cited By ~ 8

Author(s):

G. A. Hide ◽

S. J. Welham ◽

P. J. Read ◽

A. E. Ainsley

Keyword(s):

Experimental Design ◽

Total Yield ◽

Tuber Size ◽

Tuber Number ◽

Potato Seed ◽

Size Distributions ◽

Nearest Neighbour ◽

Seed Tubers ◽

Nearest Neighbours ◽

Phoma Foveata

SUMMARYPotato seed tubers infected or not infected with gangrene (Phoma foveata) were planted at Rothatnsted in 1987 to measure the effect of the disease and of neighbouring plants on yield. The experimental design was constructed so that the effect on growth of six adjacent plants (two nearest neighbours in each direction within rows and one nearest neighbour in each direction across rows) could be estimated for each plant. Total yield, ware (> 150 g) yield and tuber number from individual plants were affected most by the disease but also, in decreasing importance, by the two plants on either side within the same row (first neighbours), the two plants adjacent to the first neighbours (second neighbours) and the two adjacent plants in the rows on either side. Yield and tuber numbers increased as the different combinations of neighbouring plants contained increasing proportions of plants from diseased seed and missing plants; plants compensated for decreasing competition. Tuber size distributions showed that numbers of ware tubers decreased with increasing competition whereas numbers of small tubers were less affected. The fitted model was used to predict yields from crops planted with different proportions of diseased or missing seed tubers.

Download Full-text

PREDIKSI HARGA CRYPTOCURRENCY DENGAN METODE K-NEAREST NEIGHBOURS

Jurnal Pilar Nusa Mandiri ◽

10.33480/pilar.v14i2.894 ◽

2018 ◽

Vol 14 (2) ◽

pp. 137

Author(s):

Haerul Fatah ◽

Agus Subekti

Keyword(s):

Mean Squared Error ◽

Mean Absolute Error ◽

Search Algorithm ◽

Absolute Error ◽

Nearest Neighbour ◽

Root Mean Squared Error ◽

Squared Error ◽

Nearest Neighbours

Uang elektronik menjadi pilihan yang mulai ramai digunakan oleh banyak orang, terutama para pengusaha, pebisnis dan investor, karena menganggap bahwa uang elektronik akan menggantikan uang fisik dimasa depan. Cryptocurrency muncul sebagai jawaban atas kendala uang eletronik yang sangat bergantung kepada pihak ketiga. Salah satu jenis Cryptocurrency yaitu Bitcoin. Analogi keuangan Bitcoin sama dengan analogi pasar saham, yakni fluktuasi harga tidak tentu setiap detik. Tujuan dari penelitian yang dilakukan yaitu melakukan prediksi harga Cryptocurrency dengan menggunakan metode KNN (K-Nearest Neighbours). Hasil dari penelitian ini diketahui bahwa model KNN yang paling baik dalam memprediksi harga Cryptocurrency adalah KNN dengan parameter nilai K=3 dan Nearest Neighbour Search Algorithm : Linear NN Search. Dengan nilai Mean Absolute Error (MAE) sebesar 0.0018 dan Root Mean Squared Error (RMSE) sebesar 0.0089.

Download Full-text

The importance of population density and inter-individual distance in conserving the European oysterOstrea edulis

Journal of the Marine Biological Association of the United Kingdom ◽

10.1017/s0025315418000395 ◽

2018 ◽

Vol 99 (3) ◽

pp. 587-593 ◽

Cited By ~ 5

Author(s):

C. Guy ◽

D. Smyth ◽

D. Roberts

Keyword(s):

Brood Size ◽

Fertilization Success ◽

Nearest Neighbour ◽

Allee Effects ◽

Ostrea Edulis ◽

Model Species ◽

Nearest Neighbours ◽

Broadcast Spawner ◽

Individual Distance ◽

Wet Weight

Fertilization success will determine the rate at which a population can expand and is especially important when considering small, establishing or enduring communities. Introduced species frequently fail to establish reproductively functional populations due to strong Allee effects associated with low densities. The native European oyster,Ostrea edulisbroods its fertilized eggs in the pallial cavity for a period of 8–10 days before releasing the larvae. It is considered a partial broadcast spawner and was used as a model species to assess the importance of Allee effects such as inter-individual distance on reproductive success. Distances between individual oysters within test plots in areas of known oyster density were used in conjunction with standardized brood size (n larvae g−1total wet weight) to assess fertilization success. A significant, positive relationship was observed between brood size and oyster density. Oysters with a nearest neighbour ≤1.5 m were found to brood significantly more larvae than individuals with nearest neighbours ≥1.5 m. Therefore, high density sites need to be maintained to ensure the recovery and enhancement of this OSPAR Convention recognized species in decline.

Download Full-text