Algorithms for processing closest-pairs and nearest-neighbors queries on big spatial data in parallel and distributed frameworks

Mapping Intimacies ◽

10.12681/eadd/49345 ◽

2021 ◽

Author(s):

Παναγιώτης Μουτάφης

Keyword(s):

Spatial Data ◽

Nearest Neighbor ◽

Nearest Neighbors ◽

Apache Spark ◽

K Nearest Neighbor ◽

Apache Hadoop ◽

Nearest Neighbor Query ◽

Join Queries ◽

Nearest Neighbor Queries

Τα Χωρικά Δεδομένα αναφέρονται σε δεδομένα που σχετίζονται με τη θέση ή τη γεωγραφική τοποθεσία αντικειμένων και στοιχείων υπεράνω, υπό ή επί της επιφάνειας της γης. Τέτοια δεδομένα, συχνά ονομάζονται γεωχωρικά δεδομένα, εμφανίζονται σε εφαρμογές σχετικές με τη γεωγραφία. Καθημερινά, πολυπληθείς εφαρμογές και πηγές δημιουργούν εκρηκτικούς όγκους δεδομένων με χωρικά χαρακτηριστικά ή με σχετική γεωχωρική πληροφορία. Αισθητήρες, εφαρμογές σε κινητά τηλέφωνα, αυτοκίνητα, συσκευές GPS, μη επανδρωμένα εναέρια οχήματα (UAV), πλοία, αεροπλάνα, τηλεσκόπια, ιατρικές συσκευές, διαδικτυακές εφαρμογές, κοινωνικά δίκτυα και συσκευές διαδικτύου των αντικειμένων (IoT) αποτελούν παραδείγματα τέτοιων εφαρμογών και πηγών.Η επεξεργασία των χωρικών δεδομένων είναι δυσκολότερη σε σχέση με τα δεδομένα των παραδοσιακών εφαρμογών (π.χ. ονόματα, αριθμοί, ημερομηνίες, κλπ.) και έχουν υπολογιστικές υψηλότερες απαιτήσεις. Επιπλέον, ο μεγάλος όγκος των χωρικών δεδομένων στις σύγχρονες εφαρμογές απαιτεί τη χρήση συστημάτων πολλαπλών κόμβων για την επεξεργασία τους. Μεταξύ αυτών, τα παράλληλα και κατανεμημένα συστήματα χωρίς διαμοίραση (shared-nothing) που βασίζονται στο μοντέλο MapReduce και/ή στα Ανθεκτικά Κατανεμημένα Σύνολα Δεδομένων (Resilient Distributed Datasets RDDs) απαντώνται συχνά στις ερευνητικές προσπάθειες.Η αποτελεσματική διαχείριση των μεγάλων χωρικών δεδομένων απαιτεί αποτελεσματική επεξεργασία των υπολογιστικά απαιτητικών χωρικών ερωτημάτων. Τα ακόλουθα χωρικά ερωτήματα εφαρμόζονται σε δυο σύνολα δεδομένων και συνδυάζουν ερωτήματα ζεύξης (join queries), καθώς όλοι οι δυνατοί συνδυασμοί που σχηματίζονται από αυτά τα σύνολα δεδομένων είναι υποψήφιοι για το τελικό αποτέλεσμα, και ερωτήματα εγγυτέρων γειτόνων (nearest neighbor queries), καθώς το τελικό αποτέλεσμα διαμορφώνεται σύμφωνα με ένα κριτήριο γειτονικότητας.1. Το Ερώτημα των K Εγγυτέρων Ζευγών (K Closest-Pairs Query, KCPQ): για κάθε πιθανό ζεύγος στοιχείων από τα δυο σύνολα δεδομένων, ανακαλύπτει τα K ζεύγη μετις μικρότερες αποστάσεις μεταξύ των στοιχείων τους.2. Το Ερώτημα Ζεύξης Απόστασης (Distance Join Query, DJQ): είναι ένα είδος ερωτήματος εγγυτέρων ζευγών το οποίο, για κάθε πιθανό ζεύγος στοιχείων από τα δυοσύνολα δεδομένων, επιστρέφει τα ζεύγη με αποστάσεις μικρότερες από μια δοσμένη απόσταση.3. Το Ερώτημα Όλων των K Εγγυτέρων Γειτόνων (All K Nearest Neighbor Query, AKNNQ), που ονομάζεται κσι Ζεύξη K Εγγυτέρων Γειτόνων (K NearestNeighbor Join): επιστρέφει τους K εγγύτερους γείτονες στο ένα σύνολο για κάθε στοιχείο του άλλου συνόλου.4. Το Ερώτημα Ομάδας K Εγγυτέρων Γειτόνων (Group (K) Nearest-Neighbor(s) Query, GKNNQ): επιστρέφει K στοιχεία από το ένα σύνολο με το μικρότερο άθροισμα αποστάσεων προς κάθε στοιχείο του άλλου συνόλου.Παρόλο που οι αφελείς αλγόριθμοι για τα παραπάνω ερωτήματα είναι απλοί, πάσχουν από υπερβολικό κόστος υπολογισμού, αποθήκευσης ενδιάμεσου αποτελέσματος και δικτυακής επικοινωνίας και χαμηλής εξισορρόπισης φορτίου μεταξύ των υπολογιστικών κόμβων, ιδιαίτερα σε ένα κατανεμημένο περιβάλλον. Σε αυτή τη διατριβή, επικεντρωνόμαστε σε σημειακά δεδομένα και χρησιμοποιούμε τεχνικές για γρηγορότερους και λιγότερους υπολογισμούς, περικοπή των μη αναγκαίων υπολογισμών, εκμετάλλευση της τοπικότητας και της κατανομής των δεδομένων, καλύτερης εξισορρόπησης του φορτίου μεταξύ των υπολογιστικών κόμβων και βελτιστοποίησης της ποσότητας των δεδομένων που διακινούνται μεταξύ των κόμβων. Με αυτά τα εφόδια,1. αναπτύσσουμε τους πρώτους KCPQ και DJQ αλγορίθμους για το Apache Spark, ένα δημοφιλές σύστημα παράλληλης και κατανεμημένης επεξεργασίας το οποίο έχει προσελκύσει την προσοχή εξαιτίας των δυνατοτήτων υπολογισμού εντός μνήμης,2. αναπτύσσουμε AKNNQ αλγορίθμους για το Apache Hadoop, το πρώτο ευρέως αποδεκτό σύστημα που υλοποιεί το μοντέλο MapReduce,3. αναπτύσσουμε τους πρώτους GKNNQ αλγορίθμους για το Apache Hadoop και το SpatialHadoop, μια επέκταση ειδικά σχεδιασμένη να διαχειρίζεται μεγάλα σύνολα χωρικώνδεδομένων,4. για κάθε ένα από τα παραπάνω ερωτήματα, διενεργούμε εκτεταμένα πειράματα για να εξάγουμε τις καλύτερες ρυθμίσεις των παραμέτρων για κάθε αλγόριθμο και νασυγκρίνουμε την αποτελεσματικότητα των διαφόρων εναλλακτικών αλγορίθμων που αναπτύξαμε και εκείνων της βιβλιογραφίας (για τις περιπτώσεις εκείνες όπου τέτοιοιαλγόριθμοι προϋπήρχαν).

Download Full-text

Parallel Queries of Cluster-Based k Nearest Neighbor in MapReduce

Advances in Systems Analysis, Software Engineering, and High Performance Computing - Managing Big Data in Cloud Computing Environments ◽

10.4018/978-1-4666-9834-5.ch007 ◽

2016 ◽

pp. 163-182

Author(s):

Wei Yan

Keyword(s):

Spatial Data ◽

Spatial Databases ◽

Nearest Neighbor ◽

Programming Model ◽

K Nearest Neighbor ◽

K Nearest Neighbors ◽

Data Intensive ◽

Parallel Queries ◽

Massive Spatial Data ◽

Nearest Neighbor Queries

Parallel queries of k Nearest Neighbor for massive spatial data are an important issue. The k nearest neighbor queries (kNN queries), designed to find k nearest neighbors from a dataset S for every point in another dataset R, is a useful tool widely adopted by many applications including knowledge discovery, data mining, and spatial databases. In cloud computing environments, MapReduce programming model is a well-accepted framework for data-intensive application over clusters of computers. This chapter proposes a parallel method of kNN queries based on clusters in MapReduce programming model. Firstly, this chapter proposes a partitioning method of spatial data using Voronoi diagram. Then, this chapter clusters the data point after partition using k-means method. Furthermore, this chapter proposes an efficient algorithm for processing kNN queries based on k-means clusters using MapReduce programming model. Finally, extensive experiments evaluate the efficiency of the proposed approach.

Download Full-text

Efficient Group K Nearest-Neighbor Spatial Query Processing in Apache Spark

ISPRS International Journal of Geo-Information ◽

10.3390/ijgi10110763 ◽

2021 ◽

Vol 10 (11) ◽

pp. 763

Author(s):

Panagiotis Moutafis ◽

George Mavrommatis ◽

Michael Vassilakopoulos ◽

Antonio Corral

Keyword(s):

Query Processing ◽

Nearest Neighbor ◽

Apache Spark ◽

Spatial Query ◽

K Nearest Neighbor ◽

Distributed Computing Systems ◽

Spatial Query Processing ◽

Apache Hadoop ◽

The One ◽

Query Algorithm

Aiming at the problem of spatial query processing in distributed computing systems, the design and implementation of new distributed spatial query algorithms is a current challenge. Apache Spark is a memory-based framework suitable for real-time and batch processing. Spark-based systems allow users to work on distributed in-memory data, without worrying about the data distribution mechanism and fault-tolerance. Given two datasets of points (called Query and Training), the group K nearest-neighbor (GKNN) query retrieves (K) points of the Training with the smallest sum of distances to every point of the Query. This spatial query has been actively studied in centralized environments and several performance improving techniques and pruning heuristics have been also proposed, while, a distributed algorithm in Apache Hadoop was recently proposed by our team. Since, in general, Apache Hadoop exhibits lower performance than Spark, in this paper, we present the first distributed GKNN query algorithm in Apache Spark and compare it against the one in Apache Hadoop. This algorithm incorporates programming features and facilities that are specific to Apache Spark. Moreover, techniques that improve performance and are applicable in Apache Spark are also incorporated. The results of an extensive set of experiments with real-world spatial datasets are presented, demonstrating that our Apache Spark GKNN solution, with its improvements, is efficient and a clear winner in comparison to processing this query in Apache Hadoop.

Download Full-text

Tropical Balls and Its Applications to K Nearest Neighbor over the Space of Phylogenetic Trees

Mathematics ◽

10.3390/math9070779 ◽

2021 ◽

Vol 9 (7) ◽

pp. 779

Author(s):

Ruriko Yoshida

Keyword(s):

Supervised Learning ◽

Phylogenetic Trees ◽

Nearest Neighbor ◽

Nearest Neighbors ◽

High Dimensional ◽

Learning Method ◽

Dimensional Vector ◽

K Nearest Neighbor ◽

K Nearest Neighbors

A tropical ball is a ball defined by the tropical metric over the tropical projective torus. In this paper we show several properties of tropical balls over the tropical projective torus and also over the space of phylogenetic trees with a given set of leaf labels. Then we discuss its application to the K nearest neighbors (KNN) algorithm, a supervised learning method used to classify a high-dimensional vector into given categories by looking at a ball centered at the vector, which contains K vectors in the space.

Download Full-text

Toward Oblivious Location-Based k-Nearest Neighbor Query in Smart Cities

IEEE Internet of Things Journal ◽

10.1109/jiot.2021.3068859 ◽

2021 ◽

pp. 1-1

Author(s):

Yunguo Guan ◽

Rongxing Lu ◽

Yandong Zheng ◽

Jun Shao ◽

Guiyi Wei

Keyword(s):

Nearest Neighbor ◽

Smart Cities ◽

K Nearest Neighbor ◽

Nearest Neighbor Query

Download Full-text

A Partitioning GPU-based Algorithm for Processing the k Nearest-Neighbor Query

Proceedings of the 12th International Conference on Management of Digital EcoSystems ◽

10.1145/3415958.3433071 ◽

2020 ◽

Author(s):

Polychronis Velentzas ◽

Michael Vassilakopoulos ◽

Antonio Corral

Keyword(s):

Nearest Neighbor ◽

K Nearest Neighbor ◽

Nearest Neighbor Query

Download Full-text

GPU-aided Edge Computing for Processing the k Nearest-Neighbor Query on SSD-resident Data

Internet of Things ◽

10.1016/j.iot.2021.100428 ◽

2021 ◽

pp. 100428

Author(s):

Polychronis Velentzas ◽

Michael Vassilakopoulos ◽

Antonio Corral

Keyword(s):

Nearest Neighbor ◽

Edge Computing ◽

K Nearest Neighbor ◽

Nearest Neighbor Query

Download Full-text

An Efficient Incremental Nearest Neighbor Algorithm for Processing k-Nearest Neighbor Queries with Visal and Semantic Predicates in Multimedia Information Retrieval System

Information Retrieval Technology - Lecture Notes in Computer Science ◽

10.1007/11562382_63 ◽

2005 ◽

pp. 653-658 ◽

Cited By ~ 1

Author(s):

Dong-Ho Lee ◽

Dong-Joo Park

Keyword(s):

Information Retrieval ◽

Multimedia Information ◽

Retrieval System ◽

Nearest Neighbor ◽

Information Retrieval System ◽

Multimedia Information Retrieval ◽

K Nearest Neighbor ◽

Nearest Neighbor Algorithm ◽

Nearest Neighbor Queries

Download Full-text

Continuous K Nearest Neighbor Query Scheme with Privacy and Security Guarantees in Road Networks

2018 IEEE SmartWorld, Ubiquitous Intelligence & Computing, Advanced & Trusted Computing, Scalable Computing & Communications, Cloud & Big Data Computing, Internet of People and Smart City Innovation (SmartWorld/SCALCOM/UIC/ATC/CBDCom/IOP/SCI) ◽

10.1109/smartworld.2018.00153 ◽

2018 ◽

Author(s):

Changli Zhou ◽

Tian Wang ◽

Hui Tian ◽

Wenxian Jiang

Keyword(s):

Nearest Neighbor ◽

Road Networks ◽

K Nearest Neighbor ◽

Privacy And Security ◽

Nearest Neighbor Query ◽

Security Guarantees

Download Full-text

The K Nearest Neighbor Algorithm for Imputation of Missing Longitudinal Prenatal Alcohol Data

10.21203/rs.3.rs-32456/v2 ◽

2021 ◽

Author(s):

Ayesha Sania ◽

Nicolo Pini ◽

Morgan Nelson ◽

Michael Myers ◽

Lauren Shuffrey ◽

...

Keyword(s):

Missing Data ◽

Missing Values ◽

Nearest Neighbor ◽

Learning Algorithm ◽

Drinking Behavior ◽

Nearest Neighbors ◽

First Trimester ◽

Epidemiologic Studies ◽

K Nearest Neighbor ◽

Timeline Followback

Abstract Background — Missing data are a source of bias in epidemiologic studies. This is problematic in alcohol research where data missingness is linked to drinking behavior. Methods — The Safe Passage study was a prospective investigation of prenatal drinking and fetal/infant outcomes (n=11,083). Daily alcohol consumption for last reported drinking day and 30 days prior was recorded using Timeline Followback method. Of 3.2 million person-days, data were missing for 0.36 million. We imputed missing data using a machine learning algorithm; “K Nearest Neighbor” (K-NN). K-NN imputes missing values for a participant using data of participants closest to it. Imputed values were weighted for the distances from nearest neighbors and matched for day of week. Validation was done on randomly deleted data for 5-15 consecutive days. Results — Data from 5 nearest neighbors and segments of 55 days provided imputed values with least imputation error. After deleting data segments from with no missing days first trimester, there was no difference between actual and predicted values for 64% of deleted segments. For 31% of the segments, imputed data were within +/-1 drink/day of the actual. Conclusions — K-NN can be used to impute missing data in longitudinal studies of alcohol use during pregnancy with high accuracy.

Download Full-text