Bitmap Join Indexes vs. Data Partitioning

Author(s):  
Ladjel Bellatreche

Scientific databases and data warehouses store large amounts of data ith several tables and attributes. For instance, the Sloan Digital Sky Survey (SDSS) astronomical database contains a large number of tables with hundreds of attributes, which can be queried in various combinations (Papadomanolakis & Ailamaki, 2004). These queries involve many tables using binary operations, such as joins. To speed up these queries, many optimization structures were proposed that can be divided into two main categories: redundant structures like materialized views, advanced indexing schemes (bitmap, bitmap join indexes, etc.) (Sanjay, Chaudhuri & Narasayya, 2000) and vertical partitioning (Sanjay, Narasayya & Yang 2004) and non redundant structures like horizontal partitioning (Sanjay, Narasayya & Yang 2004; Bellatreche, Boukhalfa & Mohania, 2007) and parallel processing (Datta, Moon, & Thomas, 2000; Stöhr, Märtens & Rahm, 2000). These optimization techniques are used either in a sequential manner ou combined. These combinations are done intra-structures: materialized views and indexes for redundant and partitioning and data parallel processing for no redundant. Materialized views and indexes compete for the same resource representing storage, and incur maintenance overhead in the presence of updates (Sanjay, Chaudhuri & Narasayya, 2000). None work addresses the problem of selecting combined optimization structures. In this paper, we propose two approaches; one for combining a non redundant structures horizontal partitioning and a redundant structure bitmap indexes in order to reduce the query processing and reduce the maintenance overhead, and another to exploit algorithms for vertical partitioning to generate bitmap join indexes. To facilitate the understanding of our approaches, for review these techniques in details.

2009 ◽  
pp. 2292-2300
Author(s):  
Ladjel Bellatreche

Scientific databases and data warehouses store large amounts of data ith several tables and attributes. For instance, the Sloan Digital Sky Survey (SDSS) astronomical database contains a large number of tables with hundreds of attributes, which can be queried in various combinations (Papadomanolakis & Ailamaki, 2004). These queries involve many tables using binary operations, such as joins. To speed up these queries, many optimization structures were proposed that can be divided into two main categories: redundant structures like materialized views, advanced indexing schemes (bitmap, bitmap join indexes, etc.) (Sanjay, Chaudhuri & Narasayya, 2000) and vertical partitioning (Sanjay, Narasayya & Yang 2004) and non redundant structures like horizontal partitioning (Sanjay, Narasayya & Yang 2004; Bellatreche, Boukhalfa & Mohania, 2007) and parallel processing (Datta, Moon, & Thomas, 2000; Stöhr, Märtens & Rahm, 2000). These optimization techniques are used either in a sequential manner ou combined. These combinations are done intra-structures: materialized views and indexes for redundant and partitioning and data parallel processing for no redundant. Materialized views and indexes compete for the same resource representing storage, and incur maintenance overhead in the presence of updates (Sanjay, Chaudhuri & Narasayya, 2000). None work addresses the problem of selecting combined optimization structures. In this paper, we propose two approaches; one for combining a non redundant structures horizontal partitioning and a redundant structure bitmap indexes in order to reduce the query processing and reduce the maintenance overhead, and another to exploit algorithms for vertical partitioning to generate bitmap join indexes. To facilitate the understanding of our approaches, for review these techniques in details.


Author(s):  
Ladjel Bellatreche

Decision support applications require complex queries, e.g., multi way joins defining on huge warehouses usually modelled using star schemas, i.e., a fact table and a set of data dimensions (Papadomanolakis & Ailamaki, 2004). Star schemas have an important property in terms of join operations between dimensions tables and the fact table (i.e., the fact table contains foreign keys for each dimension). None join operations between dimension tables. Joins in data warehouses (called star join queries) are particularly expensive because the fact table (the largest table in the warehouse by far) participates in every join and multiple dimensions are likely to participate in each join. To speed up star join queries, many optimization structures were proposed: redundant structures (materialized views and advanced index schemes) and non redundant structures (data partitioning and parallel processing). Recently, data partitioning is known as an important aspect of physical database design (Sanjay, Narasayya & Yang, 2004; Papadomanolakis & Ailamaki, 2004). Two types of data partitioning are available (Özsu & Valduriez, 1999): vertical and horizontal partitioning. Vertical partitioning allows tables to be decomposed into disjoint sets of columns. Horizontal partitioning allows tables, materialized views and indexes to be partitioned into disjoint sets of rows that are physically stored and usually accessed separately. Contrary to redundant structures, data partitioning does not replicate data, thereby reducing storage requirement and minimizing maintenance overhead. In this paper, we concentrate only on horizontal data partitioning (HP). HP may affect positively (1) query performance, by performing partition elimination: if a query includes a partition key as a predicate in the WHERE clause, the query optimizer will automatically route the query to only relevant partitions and (2) database manageability: for instance, by allocating partitions in different machines or by splitting any access paths: tables, materialized views, indexes, etc. Most of database systems allow three methods to perform the HP using PARTITION statement: RANGE, HASH and LIST (Sanjay, Narasayya & Yang, 2004). In the range partitioning, an access path (table, view, and index) is split according to a range of values of a given set of columns. The hash mode decomposes the data according to a hash function (provided by the system) applied to the values of the partitioning columns. The list partitioning splits a table according to the listed values of a column. These methods can be combined to generate composite partitioning. Oracle currently supports range-hash and range-list composite partitioning using PARTITION - SUBPARTITION statement. The following SQL statement shows an example of fragmenting a table Student using range partitioning.


Author(s):  
Ladjel Bellatreche

Horizontal data partitioning is the process of splitting access objects into set of disjoint rows. It was first introduced in the end of 70’s and beginning of the 80’s (Ceri et al., 1982) for logically designing databases in order to improve the query performance by eliminating unnecessary accesses to non-relevant data. It knew a large success (in the beginning of the 80’s) in designing homogeneous distributed databases (Ceri et al., 1982; Ceri et al., 1984; Özsu et al., 1999) and parallel databases (DeWitt et al., 1992; Valduriez, 1993). In distributed environment, horizontal partitioning decomposes global tables into horizontal fragments, where each partition may be spread over multiple nodes. End users at the node can perform local queries/transactions on the partition transparently (the fragmentation of data across multiple sites/processors is not visible to the users.). This increases performance for sites that have regular transactions involving certain views of data, whilst maintaining availability and security. In parallel database context (Rao et al., 2002), horizontal partitioning has been used in order to speed up query performance in a sharednothing parallel database system (DeWitt et al., 1992). This will be done by both intra-query and intra-query parallelisms (Valduriez, 1993). It also facilitates the exploitation of the inputs/outputs bandwidth of the disks by reading and writing data in parallel. In this paper, we use fragmentation and partitioning words interchangeably.


2014 ◽  
Vol 10 (3) ◽  
pp. 34-58 ◽  
Author(s):  
Amira Kerkad ◽  
Ladjel Bellatreche ◽  
Pascal Richard ◽  
Carlos Ordonez ◽  
Dominique Geniet

Analytical queries, like those used in data warehouses and OLAP, are generally interdependent. This is due to the fact that the database is usually modeled with a denormalized star schema or its variants, where most queries pass through a large central fact table. Such interaction has been largely exploited in query optimization techniques such as materialized views. Nevertheless, such approaches usually ignore buffer management and assume queries have a fixed order and are known in advance. We believe such assumptions are too strong and thus they need to be revisited and simplified. In this paper, we study the combination of two problems: buffer management and query scheduling, in both static and dynamic scenarios. We present an NP-hardness study of the joint problem, highlighting its complexity. We then introduce a new and highly efficient algorithm inspired by a beehive. We conduct an extensive experimental evaluation on a real DBMS showing the superiority of our algorithm compared to previous ones as well as its excellent scalability.


Author(s):  
Amin Jarrah ◽  
Bashar Haddad ◽  
Mohammad A. Al-Jarrah ◽  
Muhammad Bassam Obeidat

Evolutionary neural network (ENN) shows high performance in function optimization and in finding approximately global optima from searching large and complex spaces. It is one of the most efficient and adaptive optimization techniques used widely to provide candidate solutions that lead to the fitness of the problem. ENN has the extraordinary ability to search the global and learning the approximate optimal solution regardless of the gradient information of the error functions. However, ENN requires high computation and processing which requires parallel processing platforms such as field programmable gate arrays (FPGAs) and graphic processing units (GPUs) to achieve a good performance. This work involves different new implementations of ENN by exploring and adopting different techniques and opportunities for parallel processing. Different versions of ENN algorithm have also been implemented and parallelized on FPGAs platform for low latency by exploiting the parallelism and pipelining approaches. Real data form mass spectrometry data (MSD) application was tested to examine and verify our implementations. This is a very important and extensive computation application which needs to search and find the optimal features (peaks) in MSD in order to distinguish cancer patients from control patients. ENN algorithm is also implemented and parallelized on single core and GPU platforms for comparison purposes. The computation time of our optimized algorithm on FPGA and GPU has been improved by a factor of 6.75 and 6, respectively.


Sign in / Sign up

Export Citation Format

Share Document