join queries
Recently Published Documents

Subquery is widely used in database. It can be divided into related subquery and non-related subquery according to whether it is dependent on the table of the parent query. For related subqueries, it is necessary to take a tuple from the parent query before executing the subquery, that is, the content of the subquery needs to be repeatedly operated. Disk access costs of this strategy is very big, in the distributed database, because of data communication overhead, in the parent query yuan set is too low efficiency, therefore, for the class sub queries, on the basis of the optimization of the existing query strategy, combining with the characteristics of distributed database, put forward by the subquery on to join queries, eliminate redundant clauses in the subquery, eliminate accumulation function method based on distributed database query optimization strategy, and the effectiveness of the present optimization strategy is verified by experiment.

Download Full-text

Beyond equi-joins

Proceedings of the VLDB Endowment ◽

10.14778/3476249.3476306 ◽

2021 ◽

Vol 14 (11) ◽

pp. 2599-2612

Author(s):

Nikolaos Tziavelis ◽

Wolfgang Gatterbauer ◽

Mirek Riedewald

Keyword(s):

Experimental Study ◽

State Of The Art ◽

Database Systems ◽

Ranking Function ◽

Space Complexity ◽

Time And Space ◽

Running Time ◽

Join Queries ◽

Time And Space Complexity ◽

Memory Efficient

We study theta-joins in general and join predicates with conjunctions and disjunctions of inequalities in particular, focusing on ranked enumeration where the answers are returned incrementally in an order dictated by a given ranking function. Our approach achieves strong time and space complexity properties: with n denoting the number of tuples in the database, we guarantee for acyclic full join queries with inequality conditions that for every value of k , the k top-ranked answers are returned in O ( n polylog n + k log k ) time. This is within a polylogarithmic factor of O ( n + k log k ), i.e., the best known complexity for equi-joins, and even of O ( n + k ), i.e., the time it takes to look at the input and return k answers in any order. Our guarantees extend to join queries with selections and many types of projections (namely those called "free-connex" queries and those that use bag semantics). Remarkably, they hold even when the number of join results is n ℓ for a join of ℓ relations. The key ingredient is a novel O ( n polylog n )-size factorized representation of the query output , which is constructed on-the-fly for a given query and database. In addition to providing the first nontrivial theoretical guarantees beyond equi-joins, we show in an experimental study that our ranked-enumeration approach is also memory-efficient and fast in practice, beating the running time of state-of-the-art database systems by orders of magnitude.

Download Full-text

Embedded Functional Dependencies and Data-completeness Tailored Database Design

ACM Transactions on Database Systems ◽

10.1145/3450518 ◽

2021 ◽

Vol 46 (2) ◽

pp. 1-46

Author(s):

Ziheng Wei ◽

Sebastian Link

Keyword(s):

Missing Values ◽

Normal Forms ◽

Functional Dependencies ◽

Redundant Data ◽

Processing Data ◽

Data Value ◽

Schema Design ◽

Join Queries ◽

Application Data ◽

Fit For Purpose

We establish a principled schema design framework for data with missing values. The framework is based on the new notion of an embedded functional dependency, which is independent of the interpretation of missing values, able to express completeness and integrity requirements on application data, and capable of capturing redundant data value occurrences that may cause problems with processing data that meets the requirements. We establish axiomatic, algorithmic, and logical foundations for reasoning about embedded functional dependencies. These foundations enable us to introduce generalizations of Boyce-Codd and Third normal forms that avoid processing difficulties of any application data, or minimize these difficulties across dependency-preserving decompositions, respectively. We show how to transform any given schema into application schemata that meet given completeness and integrity requirements, and the conditions of the generalized normal forms. Data over those application schemata are therefore fit for purpose by design. Extensive experiments with benchmark schemata and data illustrate the effectiveness of our framework for the acquisition of the constraints, the schema design process, and the performance of the schema designs in terms of updates and join queries.

Download Full-text

Algorithms for processing closest-pairs and nearest-neighbors queries on big spatial data in parallel and distributed frameworks

10.12681/eadd/49345 ◽

2021 ◽

Author(s):

Παναγιώτης Μουτάφης

Keyword(s):

Spatial Data ◽

Nearest Neighbor ◽

Nearest Neighbors ◽

Apache Spark ◽

K Nearest Neighbor ◽

Apache Hadoop ◽

Nearest Neighbor Query ◽

Join Queries ◽

Nearest Neighbor Queries

Τα Χωρικά Δεδομένα αναφέρονται σε δεδομένα που σχετίζονται με τη θέση ή τη γεωγραφική τοποθεσία αντικειμένων και στοιχείων υπεράνω, υπό ή επί της επιφάνειας της γης. Τέτοια δεδομένα, συχνά ονομάζονται γεωχωρικά δεδομένα, εμφανίζονται σε εφαρμογές σχετικές με τη γεωγραφία. Καθημερινά, πολυπληθείς εφαρμογές και πηγές δημιουργούν εκρηκτικούς όγκους δεδομένων με χωρικά χαρακτηριστικά ή με σχετική γεωχωρική πληροφορία. Αισθητήρες, εφαρμογές σε κινητά τηλέφωνα, αυτοκίνητα, συσκευές GPS, μη επανδρωμένα εναέρια οχήματα (UAV), πλοία, αεροπλάνα, τηλεσκόπια, ιατρικές συσκευές, διαδικτυακές εφαρμογές, κοινωνικά δίκτυα και συσκευές διαδικτύου των αντικειμένων (IoT) αποτελούν παραδείγματα τέτοιων εφαρμογών και πηγών.Η επεξεργασία των χωρικών δεδομένων είναι δυσκολότερη σε σχέση με τα δεδομένα των παραδοσιακών εφαρμογών (π.χ. ονόματα, αριθμοί, ημερομηνίες, κλπ.) και έχουν υπολογιστικές υψηλότερες απαιτήσεις. Επιπλέον, ο μεγάλος όγκος των χωρικών δεδομένων στις σύγχρονες εφαρμογές απαιτεί τη χρήση συστημάτων πολλαπλών κόμβων για την επεξεργασία τους. Μεταξύ αυτών, τα παράλληλα και κατανεμημένα συστήματα χωρίς διαμοίραση (shared-nothing) που βασίζονται στο μοντέλο MapReduce και/ή στα Ανθεκτικά Κατανεμημένα Σύνολα Δεδομένων (Resilient Distributed Datasets RDDs) απαντώνται συχνά στις ερευνητικές προσπάθειες.Η αποτελεσματική διαχείριση των μεγάλων χωρικών δεδομένων απαιτεί αποτελεσματική επεξεργασία των υπολογιστικά απαιτητικών χωρικών ερωτημάτων. Τα ακόλουθα χωρικά ερωτήματα εφαρμόζονται σε δυο σύνολα δεδομένων και συνδυάζουν ερωτήματα ζεύξης (join queries), καθώς όλοι οι δυνατοί συνδυασμοί που σχηματίζονται από αυτά τα σύνολα δεδομένων είναι υποψήφιοι για το τελικό αποτέλεσμα, και ερωτήματα εγγυτέρων γειτόνων (nearest neighbor queries), καθώς το τελικό αποτέλεσμα διαμορφώνεται σύμφωνα με ένα κριτήριο γειτονικότητας.1. Το Ερώτημα των K Εγγυτέρων Ζευγών (K Closest-Pairs Query, KCPQ): για κάθε πιθανό ζεύγος στοιχείων από τα δυο σύνολα δεδομένων, ανακαλύπτει τα K ζεύγη μετις μικρότερες αποστάσεις μεταξύ των στοιχείων τους.2. Το Ερώτημα Ζεύξης Απόστασης (Distance Join Query, DJQ): είναι ένα είδος ερωτήματος εγγυτέρων ζευγών το οποίο, για κάθε πιθανό ζεύγος στοιχείων από τα δυοσύνολα δεδομένων, επιστρέφει τα ζεύγη με αποστάσεις μικρότερες από μια δοσμένη απόσταση.3. Το Ερώτημα Όλων των K Εγγυτέρων Γειτόνων (All K Nearest Neighbor Query, AKNNQ), που ονομάζεται κσι Ζεύξη K Εγγυτέρων Γειτόνων (K NearestNeighbor Join): επιστρέφει τους K εγγύτερους γείτονες στο ένα σύνολο για κάθε στοιχείο του άλλου συνόλου.4. Το Ερώτημα Ομάδας K Εγγυτέρων Γειτόνων (Group (K) Nearest-Neighbor(s) Query, GKNNQ): επιστρέφει K στοιχεία από το ένα σύνολο με το μικρότερο άθροισμα αποστάσεων προς κάθε στοιχείο του άλλου συνόλου.Παρόλο που οι αφελείς αλγόριθμοι για τα παραπάνω ερωτήματα είναι απλοί, πάσχουν από υπερβολικό κόστος υπολογισμού, αποθήκευσης ενδιάμεσου αποτελέσματος και δικτυακής επικοινωνίας και χαμηλής εξισορρόπισης φορτίου μεταξύ των υπολογιστικών κόμβων, ιδιαίτερα σε ένα κατανεμημένο περιβάλλον. Σε αυτή τη διατριβή, επικεντρωνόμαστε σε σημειακά δεδομένα και χρησιμοποιούμε τεχνικές για γρηγορότερους και λιγότερους υπολογισμούς, περικοπή των μη αναγκαίων υπολογισμών, εκμετάλλευση της τοπικότητας και της κατανομής των δεδομένων, καλύτερης εξισορρόπησης του φορτίου μεταξύ των υπολογιστικών κόμβων και βελτιστοποίησης της ποσότητας των δεδομένων που διακινούνται μεταξύ των κόμβων. Με αυτά τα εφόδια,1. αναπτύσσουμε τους πρώτους KCPQ και DJQ αλγορίθμους για το Apache Spark, ένα δημοφιλές σύστημα παράλληλης και κατανεμημένης επεξεργασίας το οποίο έχει προσελκύσει την προσοχή εξαιτίας των δυνατοτήτων υπολογισμού εντός μνήμης,2. αναπτύσσουμε AKNNQ αλγορίθμους για το Apache Hadoop, το πρώτο ευρέως αποδεκτό σύστημα που υλοποιεί το μοντέλο MapReduce,3. αναπτύσσουμε τους πρώτους GKNNQ αλγορίθμους για το Apache Hadoop και το SpatialHadoop, μια επέκταση ειδικά σχεδιασμένη να διαχειρίζεται μεγάλα σύνολα χωρικώνδεδομένων,4. για κάθε ένα από τα παραπάνω ερωτήματα, διενεργούμε εκτεταμένα πειράματα για να εξάγουμε τις καλύτερες ρυθμίσεις των παραμέτρων για κάθε αλγόριθμο και νασυγκρίνουμε την αποτελεσματικότητα των διαφόρων εναλλακτικών αλγορίθμων που αναπτύξαμε και εκείνων της βιβλιογραφίας (για τις περιπτώσεις εκείνες όπου τέτοιοιαλγόριθμοι προϋπήρχαν).

Download Full-text

Parameterized Counting of Partially Injective Homomorphisms

Algorithmica ◽

10.1007/s00453-021-00805-y ◽

2021 ◽

Author(s):

Marc Roth

Keyword(s):

Parameterized Complexity ◽

Linear Combinations ◽

Counting Problem ◽

Fixed Parameter Tractable ◽

Large Target ◽

Graph Homomorphisms ◽

Join Queries ◽

Fixed Parameter ◽

Complexity Classification ◽

Injective Homomorphisms

AbstractWe study the parameterized complexity of the problem of counting graph homomorphisms with given partial injectivity constraints, i.e., inequalities between pairs of vertices, which subsumes counting of graph homomorphisms, subgraph counting and, more generally, counting of answers to equi-join queries with inequalities. Our main result presents an exhaustive complexity classification for the problem in fixed-parameter tractable and $$\#\mathsf {W[1]}$$ # W [ 1 ] -complete cases. The proof relies on the framework of linear combinations of homomorphisms as independently discovered by Chen and Mengel (PODS 16) and by Curticapean, Dell and Marx in the recent breakthrough result regarding the exact complexity of the subgraph counting problem (STOC 17). Moreover, we invoke Rota’s NBC-Theorem to obtain an explicit criterion for fixed-parameter tractability based on treewidth. The abstract classification theorem is then applied to the problem of counting locally injective graph homomorphisms from small pattern graphs to large target graphs. As a consequence, we are able to fully classify its parameterized complexity depending on the class of allowed pattern graphs.

Download Full-text

From natural language processing to neural databases

Proceedings of the VLDB Endowment ◽

10.14778/3447689.3447706 ◽

2021 ◽

Vol 14 (6) ◽

pp. 1033-1039

Author(s):

James Thorne ◽

Majid Yazdani ◽

Marzieh Saeidi ◽

Fabrizio Silvestri ◽

Sebastian Riedel ◽

...

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Neural Nets ◽

Join Queries ◽

The Rich ◽

New Research ◽

Aggregation Queries ◽

Performance Gains ◽

Text Images

In recent years, neural networks have shown impressive performance gains on long-standing AI problems, such as answering queries from text and machine translation. These advances raise the question of whether neural nets can be used at the core of query processing to derive answers from facts, even when the facts are expressed in natural language. If so, it is conceivable that we could relax the fundamental assumption of database management, namely, that our data is represented as fields of a pre-defined schema. Furthermore, such technology would enable combining information from text, images, and structured data seamlessly. This paper introduces neural databases , a class of systems that use NLP transformers as localized answer derivation engines. We ground the vision in NeuralDB, a system for querying facts represented as short natural language sentences. We demonstrate that recent natural language processing models, specifically transformers, can answer select-project-join queries if they are given a set of relevant facts. However, they cannot scale to non-trivial databases nor answer set-based and aggregation queries. Based on these insights, we identify specific research challenges that are needed to build neural databases. Some of the challenges require drawing upon the rich literature in data management, and others pose new research opportunities to the NLP community. Finally, we show that with preliminary solutions, NeuralDB can already answer queries over thousands of sentences with very high accuracy.

Download Full-text

Exploiting Sharing Join Opportunities in Big Data Multiquery Optimization with Flink

Complexity ◽

10.1155/2020/6617149 ◽

2020 ◽

Vol 2020 ◽

pp. 1-25

Author(s):

Xiao-Yan Gao ◽

Radhya Sahal ◽

Gui-Xiu Chen ◽

Mohammed H. Khafagy ◽

Fatma A. Omara

Keyword(s):

Big Data ◽

Execution Time ◽

Large Scale ◽

Query Execution ◽

Multiple Queries ◽

Intermediate Data ◽

Large Scale Data ◽

Join Queries ◽

Multiquery Optimization ◽

Data Granularity

Multiway join queries incur high-cost I/Os operations over large-scale data. Exploiting sharing join opportunities among multiple multiway joins could be beneficial to reduce query execution time and shuffled intermediate data. Although multiway join optimization has been carried out in MapReduce, different design principles (i.e., in-memory Big Data platforms, Flink) are not considered. To bridge the gap of not considering the optimization of Big Data platforms, an end-to-end multiway join over Flink, which is called Join-MOTH system (J-MOTH), is proposed to exploit sharing data granularity, sharing join granularity, and sharing implicit sorts within multiple join queries. For sharing data, our previous work, Multiquery Optimization using Tuple Size and Histogram (MOTH) system, has been introduced to consider the granularity of sharing data opportunities among multiple queries. For sharing sort, our previous work, Sort-Based Optimizer for Big Data Multiquery (SOOM), has been introduced to consider the implicit sorts among join queries. For sharing join, additional modules have been tailored to the J-MOTH optimizer to optimize sharing work by exploiting shared pipelined multiway join among multiple multiway join queries. The experimental evaluation has demonstrated that the J-MOTH system outperforms the naive and the state-of-the-art techniques by 44% for query execution time using TPC-H queries. Also, the proposed J-MOTH system introduces maximal intermediate data size reduction by 30% in average over Hadoop-like infrastructures.

Download Full-text

Private Set Operations Over Encrypted Cloud Dataset and Applications

The Computer Journal ◽

10.1093/comjnl/bxaa123 ◽

2020 ◽

Author(s):

Mojtaba Rafiee ◽

Shahram Khazaei

Keyword(s):

Cloud Service ◽

Santa Barbara ◽

Cloud Service Provider ◽

Set Operations ◽

Searchable Symmetric Encryption ◽

Join Queries ◽

Boolean Search ◽

Boolean Queries ◽

And Storage ◽

Security Notion

Abstract We introduce the notion of private set operations (PSO) as a symmetric-key primitive in the cloud scenario, where a client securely outsources his dataset to a cloud service provider and later privately issues queries in the form of common set operations. We define a syntax and security notion for PSO and propose a general construction that satisfies it. There are two main ingredients to our PSO scheme: an adjustable join (Adjoin) scheme (MIT-CSAIL-TR-2012-006 (2012) Cryptographic treatment of CryptDB’s adjustable join. http://people.csail.mit.edu/nickolai/papers/popa-join-tr.pdf) and a tuple set (TSet) scheme (Cash, D., Jarecki, S., Jutla, C. S., Krawczyk, H., Rosu, M.-C., and Steiner, M. (2013) Highly-Scalable Searchable Symmetric Encryption With Support for Boolean Queries. 33rd Annual Cryptology Conf., Santa Barbara, CA, August 18–22, pp. 353–373. Springer, Berlin, Heidelberg). We also propose an Adjoin construction that is substantially more efficient (in computation and storage) than the previous ones (Mironov, I., Segev, G., and Shahaf, I. (2017) Strengthening the Security of Encrypted Databases: Non-Transitive Joins. 15th Int. Conf., TCC 2017, Baltimore, MD, USA, November 12–15, pp. 631–661. Springer, Cham) due to the hardness assumption that we rely on, while retaining the same security notion. The proposed PSO scheme can be used to perform join queries on encrypted databases without revealing the duplicate patterns in the unqueried columns, which is inherent to an Adjoin scheme. In addition, we also show that the PSO scheme can be used to perform Boolean search queries on a collection of encrypted documents. We also provide standard security proofs for our constructions and present detailed efficiency evaluation and compare them with well-known previous ones.

Download Full-text

join queriesRecently Published Documents

TOTAL DOCUMENTS

H-INDEX

Join queries optimization in the distributed databases using a hybrid multi-objective algorithm

Cost-effective crowdsourced join queries for entity resolution without prior knowledge

Optimization of correlate subquery based on distributed database

Beyond equi-joins

Embedded Functional Dependencies and Data-completeness Tailored Database Design

Algorithms for processing closest-pairs and nearest-neighbors queries on big spatial data in parallel and distributed frameworks

Parameterized Counting of Partially Injective Homomorphisms

From natural language processing to neural databases

Exploiting Sharing Join Opportunities in Big Data Multiquery Optimization with Flink

Private Set Operations Over Encrypted Cloud Dataset and Applications

join queries
Recently Published Documents