scholarly journals CORRELATIONS AND QUERY PROCESSING

2020 ◽  
Vol 8 (9) ◽  
pp. 811-816
Author(s):  
Bhanu Shanker Prasad ◽  

It is known that optimization of join queries based on average selectivities is sub-optimal in highly correlated databases. Relations are naturally divided into partitions , each partition having substantially different statistical characteristics in such databases. It is very compelling to discover such data partitions during query optimization and create multiple plans for a given query , one plan being optimal for a particular combination of data partitions. This scenario calls for the sharing of state among plans, so that common intermediate results are not recomputed. We study this problem in a setting with a routing-based query execution engine based on eddies. Eddies naturally encapsulate horizontal partitioning and maximal state sharing across multiple plan. The purpose of this paper is to present faster execution time over traditional optimization for high correlations, while maintaining the same performance for low correlations.

Complexity ◽  
2020 ◽  
Vol 2020 ◽  
pp. 1-25
Author(s):  
Xiao-Yan Gao ◽  
Radhya Sahal ◽  
Gui-Xiu Chen ◽  
Mohammed H. Khafagy ◽  
Fatma A. Omara

Multiway join queries incur high-cost I/Os operations over large-scale data. Exploiting sharing join opportunities among multiple multiway joins could be beneficial to reduce query execution time and shuffled intermediate data. Although multiway join optimization has been carried out in MapReduce, different design principles (i.e., in-memory Big Data platforms, Flink) are not considered. To bridge the gap of not considering the optimization of Big Data platforms, an end-to-end multiway join over Flink, which is called Join-MOTH system (J-MOTH), is proposed to exploit sharing data granularity, sharing join granularity, and sharing implicit sorts within multiple join queries. For sharing data, our previous work, Multiquery Optimization using Tuple Size and Histogram (MOTH) system, has been introduced to consider the granularity of sharing data opportunities among multiple queries. For sharing sort, our previous work, Sort-Based Optimizer for Big Data Multiquery (SOOM), has been introduced to consider the implicit sorts among join queries. For sharing join, additional modules have been tailored to the J-MOTH optimizer to optimize sharing work by exploiting shared pipelined multiway join among multiple multiway join queries. The experimental evaluation has demonstrated that the J-MOTH system outperforms the naive and the state-of-the-art techniques by 44% for query execution time using TPC-H queries. Also, the proposed J-MOTH system introduces maximal intermediate data size reduction by 30% in average over Hadoop-like infrastructures.


2012 ◽  
Vol 433-440 ◽  
pp. 3335-3339
Author(s):  
Bo Zhu Wu

Through the in-depth study of the existing distributed database query processing technology, this paper proposes a distributed database query processing program. This program optimizes the existing query processing, stores the commonly used query results according to the query frequency, to be directly used by the subsequent queries or used as intermediate query results, thus avoiding possible transmission of a large number of data, thereby reducing the query time and improving query efficiency.


2014 ◽  
Vol 7 (14) ◽  
pp. 1857-1868 ◽  
Author(s):  
Wentao Wu ◽  
Xi Wu ◽  
Hakan Hacigümüş ◽  
Jeffrey F. Naughton

Author(s):  
Wei Yan

In order to solve the problem of storage and query for massive XML data, a method of efficient storage and parallel query for a massive volume of XML data with Hadoop is proposed. This method can store massive XML data in Hadoop and the massive XML data is divided into many XML data blocks and loaded on HDFS. The parallel query method of massive XML data is proposed, which uses parallel XPath queries based on multiple predicate selection, and the results of parallel query can satisfy the requirement of query given by the user. In this chapter, the map logic algorithm and the reduce logic algorithm based on parallel XPath queries based using MapReduce programming model are proposed, and the parallel query processing of massive XML data is realized. In addition, the method of MapReduce query optimization based on multiple predicate selection is proposed to reduce the data transfer volume of the system and improve the performance of the system. Finally, the effectiveness of the proposed method is verified by experiment.


Author(s):  
Ladjel Bellatreche

Decision support applications require complex queries, e.g., multi way joins defining on huge warehouses usually modelled using star schemas, i.e., a fact table and a set of data dimensions (Papadomanolakis & Ailamaki, 2004). Star schemas have an important property in terms of join operations between dimensions tables and the fact table (i.e., the fact table contains foreign keys for each dimension). None join operations between dimension tables. Joins in data warehouses (called star join queries) are particularly expensive because the fact table (the largest table in the warehouse by far) participates in every join and multiple dimensions are likely to participate in each join. To speed up star join queries, many optimization structures were proposed: redundant structures (materialized views and advanced index schemes) and non redundant structures (data partitioning and parallel processing). Recently, data partitioning is known as an important aspect of physical database design (Sanjay, Narasayya & Yang, 2004; Papadomanolakis & Ailamaki, 2004). Two types of data partitioning are available (Özsu & Valduriez, 1999): vertical and horizontal partitioning. Vertical partitioning allows tables to be decomposed into disjoint sets of columns. Horizontal partitioning allows tables, materialized views and indexes to be partitioned into disjoint sets of rows that are physically stored and usually accessed separately. Contrary to redundant structures, data partitioning does not replicate data, thereby reducing storage requirement and minimizing maintenance overhead. In this paper, we concentrate only on horizontal data partitioning (HP). HP may affect positively (1) query performance, by performing partition elimination: if a query includes a partition key as a predicate in the WHERE clause, the query optimizer will automatically route the query to only relevant partitions and (2) database manageability: for instance, by allocating partitions in different machines or by splitting any access paths: tables, materialized views, indexes, etc. Most of database systems allow three methods to perform the HP using PARTITION statement: RANGE, HASH and LIST (Sanjay, Narasayya & Yang, 2004). In the range partitioning, an access path (table, view, and index) is split according to a range of values of a given set of columns. The hash mode decomposes the data according to a hash function (provided by the system) applied to the values of the partitioning columns. The list partitioning splits a table according to the listed values of a column. These methods can be combined to generate composite partitioning. Oracle currently supports range-hash and range-list composite partitioning using PARTITION - SUBPARTITION statement. The following SQL statement shows an example of fragmenting a table Student using range partitioning.


Author(s):  
Mingzhu Wei ◽  
Ming Li ◽  
Elke A. Rundensteiner ◽  
Murali Mani ◽  
Hong Su

Stream applications bring the challenge of efficiently processing queries on sequentially accessible XML data streams. In this chapter, the authors study the current techniques and open challenges of XML stream processing. Firstly, they examine the input data semantics in XML streams and introduce the state-of-the-art of XML stream processing. Secondly, they compare and contrast the automatonbased and algebra-based techniques used in XML stream query execution. Thirdly, they study different optimization strategies that have been investigated for XML stream processing – in particular, they discuss cost-based optimization as well as schema-based optimization strategies. Lastly but not least, the authors list several key open challenges in XML stream processing.


2013 ◽  
Vol 441 ◽  
pp. 970-973
Author(s):  
Yan Qin Zhang ◽  
Jing Bin Wang

As the development of the semantic web, RDF data set has grown rapidly, thus causing the query problem of massive RDF. Using distributed technique to complete the SPARQL (Simple Protocol and RDF Query Language) Query is a new way of solving the large amounts of RDF query problem. At present, most of the RDF query strategies based on Hadoop have to use multiple MapReduce jobs to complete the task, resulting in waste of time. In order to overcome this drawback, MRQJ (using MapReduce to query and join) algorithm is proposed in the paper, which firstly uses a greedy strategy to generate join plan, then only one MapReduce job should be created to get the query results in SPARQL query execution. Finally, a contrast experiment on the LUBM (Lehigh University Benchmark) test data set is conducted, the results of which show that MRQJ method has a great advantage in the case that the query is more complicated.


Sign in / Sign up

Export Citation Format

Share Document