scholarly journals Distributed Mining of Outliers from Large Multi-Dimensional Databases

2018 ◽  
Vol 7 (4.7) ◽  
pp. 292
Author(s):  
K. Ashesh ◽  
Dr. G. Appa Rao

A data point is given dataset is considered to be outlier when it is not distant to all its nearest neighbours. Obviously it is based on distance measure. However, in distributed environments it is challenging to detect outliers. Many approaches to mine outliers such environments came into existence. However, a faster and more efficient way is desired. In this paper we employ a novel index tree which is hierarchical in nature. Its hierarchical structure paves way for space pruning while its clustering property helps in faster search of finding neighbours of a given data point. Its time complexity is linear to the size of dataset and its dimensions. On top of the hierarchical tree (Hi-tree) nearest neighbour search avoids unnecessary computations besides pruning unpromising points. An algorithm by name Distributed Mining of Outliers using Hi-tree (DMOH) is proposed. The index tree can be exploited with parallel processing phenomenon. We built a prototype application to demonstrate proof of the concept. Our empirical study revealed the efficiency of the proposed algorithm on top of Hi-tree.  

Author(s):  
Dafydd Evans

Mutual information quantifies the determinism that exists in a relationship between random variables, and thus plays an important role in exploratory data analysis. We investigate a class of non-parametric estimators for mutual information, based on the nearest neighbour structure of observations in both the joint and marginal spaces. Unless both marginal spaces are one-dimensional, we demonstrate that a well-known estimator of this type can be computationally expensive under certain conditions, and propose a computationally efficient alternative that has a time complexity of order ( N  log  N ) as the number of observations N →∞.


2013 ◽  
Vol 11 ◽  
pp. 25-36
Author(s):  
Eva Stopková

Proceeding deals with development and testing of the module for GRASS GIS [1], based on Nearest Neighbour Analysis. This method can be useful for assessing whether points located in area of interest are distributed randomly, in clusters or separately. The main principle of the method consists of comparing observed average distance between the nearest neighbours r A to average distance between the nearest neighbours r E that is expected in case of randomly distributed points. The result should be statistically tested. The method for two- or three-dimensional space differs in way how to compute r E . Proceeding also describes extension of mathematical background deriving standard deviation of r E , needed in statistical test of analysis result. As disposition of phenomena (e.g. distribution of birds’ nests or plant species) and test results suggest, anisotropic function would repre- sent relationships between points in three-dimensional space better than isotropic function that was used in this work.


Perception ◽  
10.1068/p3416 ◽  
2003 ◽  
Vol 32 (7) ◽  
pp. 871-886 ◽  
Author(s):  
Douglas Vickers ◽  
Pierre Bovet ◽  
Michael D Lee ◽  
Peter Hughes

The planar Euclidean version of the travelling salesperson problem (TSP) requires finding a tour of minimal length through a two-dimensional set of nodes. Despite the computational intractability of the TSP, people can produce rapid, near-optimal solutions to visually presented versions of such problems. To explain this, MacGregor et al (1999, Perception28 1417–1428) have suggested that people use a global-to-local process, based on a perceptual tendency to organise stimuli into convex figures. We review the evidence for this idea and propose an alternative, local-to-global hypothesis, based on the detection of least distances between the nodes in an array. We present the results of an experiment in which we examined the relationships between three objective measures and performance measures of optimality and response uncertainty in tasks requiring participants to construct a closed tour or an open path. The data are not well accounted for by a process based on the convex hull. In contrast, results are generally consistent with a locally focused process based initially on the detection of nearest-neighbour clusters. Individual differences are interpreted in terms of a hierarchical process of constructing solutions, and the findings are related to a more general analysis of the role of nearest neighbours in the perception of structure and motion.


Author(s):  
Xiaoyu Qin ◽  
Kai Ming Ting ◽  
Ye Zhu ◽  
Vincent CS Lee

A recent proposal of data dependent similarity called Isolation Kernel/Similarity has enabled SVM to produce better classification accuracy. We identify shortcomings of using a tree method to implement Isolation Similarity; and propose a nearest neighbour method instead. We formally prove the characteristic of Isolation Similarity with the use of the proposed method. The impact of Isolation Similarity on densitybased clustering is studied here. We show for the first time that the clustering performance of the classic density-based clustering algorithm DBSCAN can be significantly uplifted to surpass that of the recent density-peak clustering algorithm DP. This is achieved by simply replacing the distance measure with the proposed nearest-neighbour-induced Isolation Similarity in DBSCAN, leaving the rest of the procedure unchanged. A new type of clusters called mass-connected clusters is formally defined. We show that DBSCAN, which detects density-connected clusters, becomes one which detects mass-connected clusters, when the distance measure is replaced with the proposed similarity. We also provide the condition under which mass-connected clusters can be detected, while density-connected clusters cannot.


2016 ◽  
Vol 43 (4) ◽  
pp. 440-457
Author(s):  
Youngki Park ◽  
Heasoo Hwang ◽  
Sang-goo Lee

Finding k-nearest neighbours ( k-NN) is one of the most important primitives of many applications such as search engines and recommendation systems. However, its computational cost is extremely high when searching for k-NN points in a huge collection of high-dimensional points. Locality-sensitive hashing (LSH) has been introduced for an efficient k-NN approximation, but none of the existing LSH approaches clearly outperforms others. We propose a novel LSH approach, Signature Selection LSH (S2LSH), which finds approximate k-NN points very efficiently in various datasets. It first constructs a large pool of highly diversified signature regions with various sizes. Given a query point, it dynamically generates a query-specific signature region by merging highly effective signature regions selected from the signature pool. We also suggest S2LSH-M, a variant of S2LSH, which processes multiple queries more efficiently by using query-specific features and optimization techniques. Extensive experiments show the performance superiority of our approaches in diverse settings.


Author(s):  
Dafydd Evans

In practical data analysis, methods based on proximity (near-neighbour) relationships between sample points are important because these relations can be computed in time ( n  log  n ) as the number of points n →∞. Associated with such methods are a class of random variables defined to be functions of a given point and its nearest neighbours in the sample. If the sample points are independent and identically distributed, the associated random variables will also be identically distributed but not independent. Despite this, we show that random variables of this type satisfy a strong law of large numbers, in the sense that their sample means converge to their expected values almost surely as the number of sample points n →∞.


1995 ◽  
Vol 125 (1) ◽  
pp. 51-60 ◽  
Author(s):  
G. A. Hide ◽  
S. J. Welham ◽  
P. J. Read ◽  
A. E. Ainsley

SUMMARYPotato seed tubers infected or not infected with gangrene (Phoma foveata) were planted at Rothatnsted in 1987 to measure the effect of the disease and of neighbouring plants on yield. The experimental design was constructed so that the effect on growth of six adjacent plants (two nearest neighbours in each direction within rows and one nearest neighbour in each direction across rows) could be estimated for each plant. Total yield, ware (> 150 g) yield and tuber number from individual plants were affected most by the disease but also, in decreasing importance, by the two plants on either side within the same row (first neighbours), the two plants adjacent to the first neighbours (second neighbours) and the two adjacent plants in the rows on either side. Yield and tuber numbers increased as the different combinations of neighbouring plants contained increasing proportions of plants from diseased seed and missing plants; plants compensated for decreasing competition. Tuber size distributions showed that numbers of ware tubers decreased with increasing competition whereas numbers of small tubers were less affected. The fitted model was used to predict yields from crops planted with different proportions of diseased or missing seed tubers.


2018 ◽  
Vol 14 (2) ◽  
pp. 137
Author(s):  
Haerul Fatah ◽  
Agus Subekti

Uang elektronik menjadi pilihan yang mulai ramai digunakan oleh banyak orang, terutama para pengusaha, pebisnis dan investor, karena menganggap bahwa uang elektronik akan menggantikan uang fisik dimasa depan. Cryptocurrency muncul sebagai jawaban atas kendala uang eletronik yang sangat bergantung kepada pihak ketiga. Salah satu jenis Cryptocurrency yaitu Bitcoin. Analogi keuangan Bitcoin sama dengan analogi pasar saham, yakni fluktuasi harga tidak tentu setiap detik. Tujuan dari penelitian yang dilakukan yaitu melakukan prediksi harga Cryptocurrency dengan menggunakan metode KNN (K-Nearest Neighbours). Hasil dari penelitian ini diketahui bahwa model KNN yang paling baik dalam memprediksi harga Cryptocurrency adalah KNN dengan parameter nilai K=3 dan Nearest Neighbour Search Algorithm : Linear NN Search. Dengan nilai Mean Absolute Error (MAE) sebesar 0.0018 dan Root Mean Squared Error (RMSE) sebesar 0.0089.


Author(s):  
C. Guy ◽  
D. Smyth ◽  
D. Roberts

Fertilization success will determine the rate at which a population can expand and is especially important when considering small, establishing or enduring communities. Introduced species frequently fail to establish reproductively functional populations due to strong Allee effects associated with low densities. The native European oyster,Ostrea edulisbroods its fertilized eggs in the pallial cavity for a period of 8–10 days before releasing the larvae. It is considered a partial broadcast spawner and was used as a model species to assess the importance of Allee effects such as inter-individual distance on reproductive success. Distances between individual oysters within test plots in areas of known oyster density were used in conjunction with standardized brood size (n larvae g−1total wet weight) to assess fertilization success. A significant, positive relationship was observed between brood size and oyster density. Oysters with a nearest neighbour ≤1.5 m were found to brood significantly more larvae than individuals with nearest neighbours ≥1.5 m. Therefore, high density sites need to be maintained to ensure the recovery and enhancement of this OSPAR Convention recognized species in decline.


Sign in / Sign up

Export Citation Format

Share Document