CORRELATIONS AND QUERY PROCESSING

It is known that optimization of join queries based on average selectivities is sub-optimal in highly correlated databases. Relations are naturally divided into partitions , each partition having substantially different statistical characteristics in such databases. It is very compelling to discover such data partitions during query optimization and create multiple plans for a given query , one plan being optimal for a particular combination of data partitions. This scenario calls for the sharing of state among plans, so that common intermediate results are not recomputed. We study this problem in a setting with a routing-based query execution engine based on eddies. Eddies naturally encapsulate horizontal partitioning and maximal state sharing across multiple plan. The purpose of this paper is to present faster execution time over traditional optimization for high correlations, while maintaining the same performance for low correlations.

Download Full-text

Exploiting Sharing Join Opportunities in Big Data Multiquery Optimization with Flink

Complexity ◽

10.1155/2020/6617149 ◽

2020 ◽

Vol 2020 ◽

pp. 1-25

Author(s):

Xiao-Yan Gao ◽

Radhya Sahal ◽

Gui-Xiu Chen ◽

Mohammed H. Khafagy ◽

Fatma A. Omara

Keyword(s):

Big Data ◽

Execution Time ◽

Large Scale ◽

Query Execution ◽

Multiple Queries ◽

Intermediate Data ◽

Large Scale Data ◽

Join Queries ◽

Multiquery Optimization ◽

Data Granularity

Multiway join queries incur high-cost I/Os operations over large-scale data. Exploiting sharing join opportunities among multiple multiway joins could be beneficial to reduce query execution time and shuffled intermediate data. Although multiway join optimization has been carried out in MapReduce, different design principles (i.e., in-memory Big Data platforms, Flink) are not considered. To bridge the gap of not considering the optimization of Big Data platforms, an end-to-end multiway join over Flink, which is called Join-MOTH system (J-MOTH), is proposed to exploit sharing data granularity, sharing join granularity, and sharing implicit sorts within multiple join queries. For sharing data, our previous work, Multiquery Optimization using Tuple Size and Histogram (MOTH) system, has been introduced to consider the granularity of sharing data opportunities among multiple queries. For sharing sort, our previous work, Sort-Based Optimizer for Big Data Multiquery (SOOM), has been introduced to consider the implicit sorts among join queries. For sharing join, additional modules have been tailored to the J-MOTH optimizer to optimize sharing work by exploiting shared pipelined multiway join among multiple multiway join queries. The experimental evaluation has demonstrated that the J-MOTH system outperforms the naive and the state-of-the-art techniques by 44% for query execution time using TPC-H queries. Also, the proposed J-MOTH system introduces maximal intermediate data size reduction by 30% in average over Hadoop-like infrastructures.

Download Full-text

Enhancement of Query Execution Time in SPARQL Query Processing

2020 International Conference on Advanced Information Technologies (ICAIT) ◽

10.1109/icait51105.2020.9261805 ◽

2020 ◽

Author(s):

Khin Myat Kyu ◽

Aung Nway Oo

Keyword(s):

Query Processing ◽

Execution Time ◽

Sparql Query ◽

Query Execution

Download Full-text

Research on Query Optimization Technology in Distributed Database

Advanced Materials Research ◽

10.4028/www.scientific.net/amr.433-440.3335 ◽

2012 ◽

Vol 433-440 ◽

pp. 3335-3339

Author(s):

Bo Zhu Wu

Keyword(s):

Query Processing ◽

Query Optimization ◽

Distributed Database ◽

Query Time ◽

Processing Technology ◽

Database Query ◽

Processing Program ◽

Depth Study ◽

Database Query Processing ◽

Query Efficiency

Through the in-depth study of the existing distributed database query processing technology, this paper proposes a distributed database query processing program. This program optimizes the existing query processing, stores the commonly used query results according to the query frequency, to be directly used by the subsequent queries or used as intermediate query results, thus avoiding possible transmission of a large number of data, thereby reducing the query time and improving query efficiency.

Download Full-text

Uncertainty aware query execution time prediction

Proceedings of the VLDB Endowment ◽

10.14778/2733085.2733092 ◽

2014 ◽

Vol 7 (14) ◽

pp. 1857-1868 ◽

Cited By ~ 9

Author(s):

Wentao Wu ◽

Xi Wu ◽

Hakan Hacigümüş ◽

Jeffrey F. Naughton

Keyword(s):

Execution Time ◽

Query Execution ◽

Time Prediction ◽

Query Execution Time Prediction ◽

Execution Time Prediction

Download Full-text

SINGLE vs. MapReduce vs. Relational: Predicting Query Execution Time

Beyond Databases, Architectures and Structures. Facing the Challenges of Data Proliferation and Growing Variety - Communications in Computer and Information Science ◽

10.1007/978-3-319-99987-6_5 ◽

2018 ◽

pp. 63-74 ◽

Cited By ~ 1

Author(s):

Maryam Abbasi ◽

Pedro Martins ◽

José Cecílio ◽

João Costa ◽

Pedro Furtado

Keyword(s):

Execution Time ◽

Query Execution

Download Full-text

“Performance Evaluation of Global Query Execution in Distributed Databases Through Semantic Query Optimization Technique”

Proceedings of the International Conference on Computer Applications — Database Systems ◽

10.3850/978-981-08-7300-4_1470 ◽

2010 ◽

Author(s):

Deepak Sukheja ◽

Umesh K. Singh

Keyword(s):

Performance Evaluation ◽

Query Optimization ◽

Distributed Databases ◽

Optimization Technique ◽

Query Execution ◽

Semantic Query ◽

Semantic Query Optimization ◽

Global Query

Download Full-text

Efficient Storage and Parallel Query of Massive XML Data in Hadoop

Advances in Data Mining and Database Management - Emerging Technologies and Applications in Data Processing and Management ◽

10.4018/978-1-5225-8446-9.ch012 ◽

2019 ◽

pp. 242-262

Author(s):

Wei Yan

Keyword(s):

Query Processing ◽

Query Optimization ◽

Data Transfer ◽

Programming Model ◽

Xml Data ◽

Efficient Storage ◽

Parallel Query ◽

Parallel Query Processing

In order to solve the problem of storage and query for massive XML data, a method of efficient storage and parallel query for a massive volume of XML data with Hadoop is proposed. This method can store massive XML data in Hadoop and the massive XML data is divided into many XML data blocks and loaded on HDFS. The parallel query method of massive XML data is proposed, which uses parallel XPath queries based on multiple predicate selection, and the results of parallel query can satisfy the requirement of query given by the user. In this chapter, the map logic algorithm and the reduce logic algorithm based on parallel XPath queries based using MapReduce programming model are proposed, and the parallel query processing of massive XML data is realized. In addition, the method of MapReduce query optimization based on multiple predicate selection is proposed to reduce the data transfer volume of the system and improve the performance of the system. Finally, the effectiveness of the proposed method is verified by experiment.

Download Full-text

A Genetic Algorithm for Selecting Horizontal Fragments

Encyclopedia of Data Warehousing and Mining, Second Edition ◽

10.4018/978-1-60566-010-3.ch142 ◽

2011 ◽

pp. 920-925

Author(s):

Ladjel Bellatreche

Keyword(s):

Database Systems ◽

Data Partitioning ◽

Materialized Views ◽

Access Path ◽

Multiple Dimensions ◽

Physical Database Design ◽

Join Queries ◽

Speed Up ◽

Disjoint Sets ◽

Horizontal Partitioning

Decision support applications require complex queries, e.g., multi way joins defining on huge warehouses usually modelled using star schemas, i.e., a fact table and a set of data dimensions (Papadomanolakis & Ailamaki, 2004). Star schemas have an important property in terms of join operations between dimensions tables and the fact table (i.e., the fact table contains foreign keys for each dimension). None join operations between dimension tables. Joins in data warehouses (called star join queries) are particularly expensive because the fact table (the largest table in the warehouse by far) participates in every join and multiple dimensions are likely to participate in each join. To speed up star join queries, many optimization structures were proposed: redundant structures (materialized views and advanced index schemes) and non redundant structures (data partitioning and parallel processing). Recently, data partitioning is known as an important aspect of physical database design (Sanjay, Narasayya & Yang, 2004; Papadomanolakis & Ailamaki, 2004). Two types of data partitioning are available (Özsu & Valduriez, 1999): vertical and horizontal partitioning. Vertical partitioning allows tables to be decomposed into disjoint sets of columns. Horizontal partitioning allows tables, materialized views and indexes to be partitioned into disjoint sets of rows that are physically stored and usually accessed separately. Contrary to redundant structures, data partitioning does not replicate data, thereby reducing storage requirement and minimizing maintenance overhead. In this paper, we concentrate only on horizontal data partitioning (HP). HP may affect positively (1) query performance, by performing partition elimination: if a query includes a partition key as a predicate in the WHERE clause, the query optimizer will automatically route the query to only relevant partitions and (2) database manageability: for instance, by allocating partitions in different machines or by splitting any access paths: tables, materialized views, indexes, etc. Most of database systems allow three methods to perform the HP using PARTITION statement: RANGE, HASH and LIST (Sanjay, Narasayya & Yang, 2004). In the range partitioning, an access path (table, view, and index) is split according to a range of values of a given set of columns. The hash mode decomposes the data according to a hash function (provided by the system) applied to the values of the partitioning columns. The list partitioning splits a table according to the listed values of a column. These methods can be combined to generate composite partitioning. Oracle currently supports range-hash and range-list composite partitioning using PARTITION - SUBPARTITION statement. The following SQL statement shows an example of fragmenting a table Student using range partitioning.

Download Full-text

XML Stream Query Processing

Open and Novel Issues in XML Database Applications ◽

10.4018/978-1-60566-308-1.ch005 ◽

2010 ◽

pp. 89-107

Author(s):

Mingzhu Wei ◽

Ming Li ◽

Elke A. Rundensteiner ◽

Murali Mani ◽

Hong Su

Keyword(s):

Query Processing ◽

Input Data ◽

Stream Processing ◽

Query Execution ◽

Data Semantics ◽

Stream Query Processing ◽

Xml Stream ◽

And Algebra ◽

Xml Streams ◽

Compare And Contrast

Stream applications bring the challenge of efficiently processing queries on sequentially accessible XML data streams. In this chapter, the authors study the current techniques and open challenges of XML stream processing. Firstly, they examine the input data semantics in XML streams and introduce the state-of-the-art of XML stream processing. Secondly, they compare and contrast the automatonbased and algebra-based techniques used in XML stream query execution. Thirdly, they study different optimization strategies that have been investigated for XML stream processing – in particular, they discuss cost-based optimization as well as schema-based optimization strategies. Lastly but not least, the authors list several key open challenges in XML stream processing.

Download Full-text

Query Optimization of Distributed RDF Data Based on MapReduce

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.441.970 ◽

2013 ◽

Vol 441 ◽

pp. 970-973

Author(s):

Yan Qin Zhang ◽

Jing Bin Wang

Keyword(s):

Query Optimization ◽

Query Language ◽

Sparql Query ◽

Query Execution ◽

Data Set ◽

Greedy Strategy ◽

Lehigh University ◽

Join Algorithm ◽

Simple Protocol ◽

Rdf Data

As the development of the semantic web, RDF data set has grown rapidly, thus causing the query problem of massive RDF. Using distributed technique to complete the SPARQL (Simple Protocol and RDF Query Language) Query is a new way of solving the large amounts of RDF query problem. At present, most of the RDF query strategies based on Hadoop have to use multiple MapReduce jobs to complete the task, resulting in waste of time. In order to overcome this drawback, MRQJ (using MapReduce to query and join) algorithm is proposed in the paper, which firstly uses a greedy strategy to generate join plan, then only one MapReduce job should be created to get the query results in SPARQL query execution. Finally, a contrast experiment on the LUBM (Lehigh University Benchmark) test data set is conducted, the results of which show that MRQJ method has a great advantage in the case that the query is more complicated.

Download Full-text