Acceleratingk-nearest-neighbor searches

The search for whichkpoints are closest to a given probe point in a space ofNknown points, the `k-nearest-neighbor' or `KNN' problem, is a computationally challenging problem of importance in many disciplines, such as the design of numerical databases, analysis of multi-dimensional experimental data sets, multi-particle simulations and data mining. A standard approach is to preprocess the data into a tree and make use of the triangle inequality to prune the search time to the order of the logarithm ofNfor a single nearest point in a well balanced tree. All known approaches suffer from the `curse of dimensionality', which causes the search to explore many more branches of the tree than one might wish as the dimensionality of the problem increases, driving search times closer to the order ofN. Looking forknearest points can sometimes be done in approximately the time needed to search for one nearest point, but more often it requiresksearches because the results are distributed widely. The result is very long search times, especially when the search radius is large andkis large, and individual distance calculations are very expensive, because the same probe-to-data-point distance calculations need to be executed repeatedly as the top of the tree is re-explored. Combining two acceleration techniques was found to improve the search time dramatically: (i) organizing the search into nested searches in non-overlapping annuli of increasing radii, using an estimation of the Hausdorff dimension applicable to this data instance from the results of earlier annuli to help set the radius of the next annulus; and (ii) caching all distance calculations involving the probe point to reduce the cost of repeated use of the same distances. The result of this acceleration in a search of the combined macromolecular and small-molecule data in a combined six-dimensional database of nearly 900 000 entries has been an improvement in the overall time of the searches by one to two orders of magnitude.

Download Full-text

The Influence of Hubness on NN-Descent

International Journal of Artificial Intelligence Tools ◽

10.1142/s0218213019600029 ◽

2019 ◽

Vol 28 (06) ◽

pp. 1960002 ◽

Cited By ~ 3

Author(s):

Brankica Bratić ◽

Michael E. Houle ◽

Vladimir Kurbalija ◽

Vincent Oria ◽

Miloš Radovanović

Keyword(s):

Nearest Neighbor ◽

Synthetic Data ◽

Machine Learning Algorithms ◽

Data Sets ◽

Accurate Approximation ◽

K Nearest Neighbor ◽

Major Drawback ◽

Neighbor Graph ◽

Nearest Neighbor Graph ◽

The Cost

The K-nearest neighbor graph (K-NNG) is a data structure used by many machine-learning algorithms. Naive computation of the K-NNG has quadratic time complexity, which in many cases is not efficient enough, producing the need for fast and accurate approximation algorithms. NN-Descent is one such algorithm that is highly efficient, but has a major drawback in that K-NNG approximations are accurate only on data of low intrinsic dimensionality. This paper represents an experimental analysis of this behavior, and investigates possible solutions. Experimental results show that there is a link between the performance of NN-Descent and the phenomenon of hubness, defined as the tendency of intrinsically high-dimensional data to contain hubs – points with large in-degrees in the K-NNG. First, we explain how the presence of the hubness phenomenon causes bad NN-Descent performance. In light of that, we propose four NN-Descent variants to alleviate the observed negative inuence of hubs. By evaluating the proposed approaches on several real and synthetic data sets, we conclude that our approaches are more accurate, but often at the cost of higher scan rates.

Download Full-text

Estimating the Borrowing Behavior of French and German Firms. An Econometric Analysis / Verschuldungsverhalten französischer und deutscher Unternehmen. Eine ökonometrische Analyse

Jahrbücher für Nationalökonomie und Statistik ◽

10.1515/jbnst-2001-5-610 ◽

2001 ◽

Vol 221 (5-6) ◽

Author(s):

Elizabeth Kremp ◽

Elmar Stöß

Keyword(s):

Firm Growth ◽

Negative Impact ◽

Positive Impact ◽

Size Class ◽

Monetary Transmission ◽

Data Sets ◽

Pecking Order ◽

Trends Over Time ◽

Deutsche Bundesbank ◽

The Cost

SummaryThis paper investigates the borrowing behavior of 2,900 French and 1,300 German firms over the 1987-95 period. Both samples based on data sets of the Banque de France and the Deutsche Bundesbank not only include large but also small and medium-sized enterprises. Applying GMM techniques, we estimate identical debt equations for the two total samples and by size class. Despite the large differences between the two countries in term of debt trends over time and size class the main result is the similarity of a few determinants between France and Germany. E.g. we find that firm growth has a positive impact on borrowing according to the theory of signalling whereas the negative correlation of profit and debt supports pecking order approach and the cost of finance has a negative impact on leverage, too. Additionally, the study can provide some insights for the monetary transmission mechanism in both EMU member countries.

Download Full-text

ClusterTree: Integration of cluster representation and nearest-neighbor search for large data sets with high dimensions

IEEE Transactions on Knowledge and Data Engineering ◽

10.1109/tkde.2003.1232281 ◽

2003 ◽

Vol 15 (5) ◽

pp. 1316-1337 ◽

Cited By ~ 24

Author(s):

Dantong Yu ◽

Aidong Zhang

Keyword(s):

Nearest Neighbor ◽

Large Data ◽

Nearest Neighbor Search ◽

Large Data Sets ◽

Data Sets ◽

High Dimensions ◽

Neighbor Search ◽

Cluster Representation

Download Full-text

Determination of Reactivity Ratios from Binary Copolymerization Using the k-Nearest Neighbor Non-Parametric Regression

Polymers ◽

10.3390/polym13213811 ◽

2021 ◽

Vol 13 (21) ◽

pp. 3811

Author(s):

Iosif Sorin Fazakas-Anca ◽

Arina Modrea ◽

Sorin Vlase

Keyword(s):

Experimental Data ◽

Nearest Neighbor ◽

Optimization Method ◽

Reactivity Ratios ◽

Data Sets ◽

K Nearest Neighbor ◽

Integration Algorithm ◽

Data Set ◽

Parametric Regression ◽

Non Parametric

This paper proposes a new method for calculating the monomer reactivity ratios for binary copolymerization based on the terminal model. The original optimization method involves a numerical integration algorithm and an optimization algorithm based on k-nearest neighbour non-parametric regression. The calculation method has been tested on simulated and experimental data sets, at low (<10%), medium (10–35%) and high conversions (>40%), yielding reactivity ratios in a good agreement with the usual methods such as intersection, Fineman–Ross, reverse Fineman–Ross, Kelen–Tüdös, extended Kelen–Tüdös and the error in variable method. The experimental data sets used in this comparative analysis are copolymerization of 2-(N-phthalimido) ethyl acrylate with 1-vinyl-2-pyrolidone for low conversion, copolymerization of isoprene with glycidyl methacrylate for medium conversion and copolymerization of N-isopropylacrylamide with N,N-dimethylacrylamide for high conversion. Also, the possibility to estimate experimental errors from a single experimental data set formed by n experimental data is shown.

Download Full-text

Article processing charges for open access publication—the situation for research intensive universities in the USA and Canada

PeerJ ◽

10.7717/peerj.2264 ◽

2016 ◽

Vol 4 ◽

pp. e2264 ◽

Cited By ~ 25

Author(s):

David Solomon ◽

Bo-Christer Björk

Keyword(s):

Open Access ◽

Data Sets ◽

Social Sciences And Humanities ◽

European Universities ◽

Lack Of Information ◽

The Social ◽

The Usa ◽

Western European ◽

Article Processing Charges ◽

The Cost

Background.Open access (OA) publishing via article processing charges (APCs) is growing as an alternative to subscription publishing. The Pay It Forward (PIF) Project is exploring the feasibility of transitioning from paying subscriptions to funding APCs for faculty at research intensive universities. Estimating of the cost of APCs for the journals authors at research intensive universities tend to publish is essential for the PIF project and similar initiatives. This paper presents our research into this question.Methods.We identified APC prices for publications by authors at the 4 research intensive United States (US) and Canadian universities involved in the study. We also obtained APC payment records from several Western European universities and funding agencies. Both data sets were merged with Web of Science (WoS) metadata. We calculated the average APCs for articles and proceedings in 13 discipline categories published by researchers at research intensive universities. We also identified 41 journals published by traditionally subscription publishers which have recently converted to APC funded OA and recorded the APCs they charge.Results.We identified 7,629 payment records from the 4 European APC payment databases and 14,356 OA articles authored by PIF partner university faculty for which we had listed APC prices. APCs for full OA journals published by PIF authors averaged 1,775 USD; full OA journal APCs paid by Western European funders averaged 1,865 USD; hybrid APCs paid by Western European funders averaged 2,887 USD. The APC for converted journals published by major subscription publishers averaged 1,825 USD. APC funded OA is concentrated in the life and basic sciences. APCs funded articles in the social sciences and humanities are often multidisciplinary and published in journals such as PLOS ONE that largely publish in the life sciences.Conclusions.Full OA journal APCs average a little under 2,000 USD while hybrid articles average about 3,000 USD for publications by researchers at research intensive universities. There is a lack of information on discipline differences in APCs due to the concentration of APC funded publications in a few fields and the multidisciplinary nature of research.

Download Full-text

Cardiovascular Risk Detection Through Big Data Analysis

International Journal of Big Data and Analytics in Healthcare ◽

10.4018/ijbdah.2020070101 ◽

2020 ◽

Vol 5 (2) ◽

pp. 1-11

Author(s):

Miguel A. Sánchez-Acevedo ◽

Zaydi Anaí Acosta-Chi ◽

Ma. del Rocío Morales-Salgado

Keyword(s):

Nearest Neighbor ◽

State Level ◽

Mortality Data ◽

K Nearest Neighbor ◽

Web Tool ◽

Health And Nutrition ◽

Risk Detection ◽

Unhealthy Diet ◽

The Cost

Cardiovascular diseases are the main cause of mortality in the world. As more people suffer from diabetes and hypertension, the risk of cardiovascular disease (CVD) increases. A sedentary lifestyle, an unhealthy diet, and stressful activities are behaviors that can be changed to prevent CVD. Taking measures to prevent CVD lowers the cost of treatments and reduces mortality. Data-driven plans generate more effective results and can be applied to groups with similar characteristics. Currently, there are several databases that can be used to extract information in real time and improve decision making. This article proposes a methodology for the detection of CVD and a web tool to analyze the data more effectively. The methodology for extracting, describing, and visualizing data from a state-level case study of CVD in Mexico is presented. The data is obtained from the databases of the National Institute of Statistics and Geography (INEGI) and the National Survey of Health and Nutrition (ENSANUT). A k-nearest neighbor (KNN) algorithm is proposed to predict missing data.

Download Full-text

Game User Preference Data Analysis and Market Guidance Based on Dynamic Attention GRU

Discrete Dynamics in Nature and Society ◽

10.1155/2021/5666405 ◽

2021 ◽

Vol 2021 ◽

pp. 1-10

Author(s):

Xuelian Yang ◽

Jin Bai ◽

Xiaolin Wang

Keyword(s):

Data Mining ◽

User Behavior ◽

Labor Cost ◽

Internet Technology ◽

Social Model ◽

User Preference ◽

Data Sets ◽

Marketing Model ◽

Dynamic Attention ◽

The Cost

With the development of Internet technology and social model, game products have become an important product of people’s life for entertainment and recreation, and the precise marketing of game products has become a winning means for enterprises to improve competitiveness and reduce labor cost consumption, and major game companies are also paying more and more attention to the data-based marketing model. How to dig out the effective information from the existing market behavior data is a powerful means to implement precise marketing. Achieving precise positioning and marketing of gaming market is the guarantee of innovative development of game companies. For the research on the above problem, based on the SEMAS process of data mining, this paper proposes a mining model based on recurrent neural network, which is named as Dynamic Attention GRU (DAGRU) with multiple dynamic attention mechanisms, and evaluates it on two self-built data sets of user behavior samples. The results demonstrate that the mining method can effectively analyze and predict the player behavior goals. The game marketing system based on data mining can indeed provide more accurate and automated marketing services, which greatly reduces the cost investment under the traditional marketing model and achieves accurate targeting marketing services and has certain application value.

Download Full-text

A Hybrid Spatial Indexing Structure of Massive Point Cloud Based on Octree and 3D R*-Tree

Applied Sciences ◽

10.3390/app11209581 ◽

2021 ◽

Vol 11 (20) ◽

pp. 9581

Author(s):

Wei Wang ◽

Yi Zhang ◽

Genyu Ge ◽

Qin Jiang ◽

Yang Wang ◽

...

Keyword(s):

Point Cloud ◽

Nearest Neighbor ◽

Nearest Neighbor Search ◽

Index Structure ◽

Spatial Indexing ◽

Cloud Data ◽

Neighbor Search ◽

Nearest Neighbor Searching ◽

Indexing Structure ◽

Balanced Tree

The spatial index structure is one of the most important research topics for organizing and managing massive 3D Point Cloud. As a point in Point Cloud consists of Cartesian coordinates (x,y,z), the common method to explore geometric information and features is nearest neighbor searching. An efficient spatial indexing structure directly affects the speed of the nearest neighbor search. Octree and kd-tree are the most used for Point Cloud data. However, Octree or KD-tree do not perform best in nearest neighbor searching. A highly balanced tree, 3D R*-tree is considered the most effective method so far. So, a hybrid spatial indexing structure is proposed based on Octree and 3D R*-tree. In this paper, we discussed how thresholds influence the performance of nearest neighbor searching and constructing the tree. Finally, an adaptive way method adopted to set thresholds. Furthermore, we obtained a better performance in tree construction and nearest neighbor searching than Octree and 3D R*-tree.

Download Full-text

An Incremental Isomap Method for Hyperspectral Dimensionality Reduction and Classification

Photogrammetric Engineering & Remote Sensing ◽

10.14358/pers.87.7.445 ◽

2021 ◽

Vol 87 (6) ◽

pp. 445-455

Author(s):

Yi Ma ◽

Zezhong Zheng ◽

Yutang Ma ◽

Mingcang Zhu ◽

Ran Huang ◽

...

Keyword(s):

Manifold Learning ◽

Nearest Neighbor ◽

Hyperspectral Image ◽

Hyperspectral Data ◽

Training Data ◽

Support Vector ◽

Data Sets ◽

K Nearest Neighbor ◽

Data Set ◽

Data Points

Many manifold learning algorithms conduct an eigen vector analysis on a data-similarity matrix with a size of N×N, where N is the number of data points. Thus, the memory complexity of the analysis is no less than O(N2). We pres- ent in this article an incremental manifold learning approach to handle large hyperspectral data sets for land use identification. In our method, the number of dimensions for the high-dimensional hyperspectral-image data set is obtained with the training data set. A local curvature varia- tion algorithm is utilized to sample a subset of data points as landmarks. Then a manifold skeleton is identified based on the landmarks. Our method is validated on three AVIRIS hyperspectral data sets, outperforming the comparison algorithms with a k–nearest-neighbor classifier and achieving the second best performance with support vector machine.

Download Full-text

M-N Hashing: Search Time Optimization with Collision Resolution Using Balanced Tree

Futuristic Trends in Networks and Computing Technologies - Communications in Computer and Information Science ◽

10.1007/978-981-15-4451-4_16 ◽

2020 ◽

pp. 196-209

Author(s):

Arushi Agarwal ◽

Sashakt Pathak ◽

Sakshi Agarwal

Keyword(s):

Search Time ◽

Collision Resolution ◽

Time Optimization ◽

Balanced Tree

Download Full-text