Primary and Referential Horizontal Partitioning Selection Problems

Author(s):  
Ladjel Bellatreche ◽  
Kamel Boukhalfa ◽  
Pascal Richard

Horizontal partitioning has evolved significantly in recent years and widely advocated by the academic and industrial communities. Horizontal Partitioning affects positively query performance, database manageability and availability. Two types of horizontal partitioning are supported: primary and referential. Horizontal fragmentation in the context of relational data warehouses is to partition dimension tables by primary fragmentation then fragmenting the fact table by referential fragmentation. This fragmentation can generate a very large number of fragments which may make the maintenance task very complicated. In this paper, we first focus on the evolution of horizontal partitioning in commercial DBMS motivated by decision support applications. Secondly, we give a formalization of the referential fragmentation schema selection problem in the data warehouse and we study its hardness to select an optimal solution. Due to its high complexity, we develop two algorithms: hill climbing and simulated annealing with several variants to select a near optimal partitioning schema. We present ParAdmin, an advisor tool assisting administrators to use primary and referential partitioning during the physical design of their data warehouses. Finally, extensive experimental studies are conducted using the data set of APB1 benchmark to compare the quality the proposed algorithms using a mathematical cost model. Based on these experiments, some recommendations are given to ensure the well use of horizontal partitioning.

2009 ◽  
Vol 5 (4) ◽  
pp. 1-23 ◽  
Author(s):  
Ladjel Bellatreche ◽  
Kamel Boukhalfa ◽  
Pascal Richard ◽  
Komla Yamavo Woameno

Horizontal Partitioning has been largely adopted by the database community, where it took a significant part in the physical design process. Actually, it is supported by most commercial database systems (DBMS), where a native Data Definition Language for decomposing tables/materialized views using various modes is proposed. In traditional databases, horizontal partitioning has been largely studied, where several fragmentation algorithms were proposed to partition tables in isolation. In the relational data warehouse environment, horizontal partitioning consists in decomposing the whole warehouse schema into sub schemas, where each schema contains fragments of dimension and fact tables. Dimension tables are fragmented using the primary partitioning mode, whereas the fact table is divided using referential mode. In this article, the authors first focus on the evolution of horizontal partitioning in commercial DBMS motivated by decision support applications. Secondly, they give a formalization of the referential fragmentation schema selection problem in the data warehouse and they study its hardness to select an optimal solution. Due to its high complexity, they develop two algorithms: hill climbing and simulated annealing with several variants to select a near optimal partitioning schema. Finally, extensive experimental studies are conducted using the data set of APB1 benchmark to compare the quality the proposed algorithms using a mathematical cost model. Based on these experiments, some recommendations are given to advise database administrator for well using horizontal partitioning.


2011 ◽  
Vol 1 (3) ◽  
pp. 32-46 ◽  
Author(s):  
Minghuang Li ◽  
Fusheng Yu

Building a linear fitting model for a given interval-valued data set is challenging since the minimization of the residue function leads to a huge combinatorial problem. To overcome such a difficulty, this article proposes a new semidefinite programming-based method for implementing linear fitting to interval-valued data. First, the fitting model is cast to a problem of quadratically constrained quadratic programming (QCQP), and then two formulae are derived to develop the lower bound on the optimal value of the nonconvex QCQP by semidefinite relaxation and Lagrangian relaxation. In many cases, this method can solve the fitting problem by giving the exact optimal solution. Even though the lower bound is not the optimal value, it is still a good approximation of the global optimal solution. Experimental studies on different fitting problems of different scales demonstrate the good performance and stability of our method. Furthermore, the proposed method performs very well in solving relatively large-scale interval-fitting problems.


2012 ◽  
Vol 23 (4) ◽  
pp. 17-51 ◽  
Author(s):  
Ladjel Bellatreche ◽  
Alfredo Cuzzocrea ◽  
Soumia Benkrid

In this paper, a comprehensive methodology for designing and querying Parallel Rational Data Warehouses (PRDW) over database clusters, called Fragmentation & Allocation (F&A) is proposed. F&A assumes that cluster nodes are heterogeneous in processing power and storage capacity, contrary to traditional design approaches that assume that cluster nodes are instead homogeneous, and fragmentation and allocation phases are performed in a simultaneous manner. In classical approaches, two different cost models are used to perform fragmentation and allocation, separately, whereas F&A makes use of one cost model that considers fragmentation and allocation parameters simultaneously. Therefore, according to the F&A methodology proposed, the allocation phase/decision is done at fragmentation. At the fragmentation phase, F&A uses two well-known algorithms, namely Hill Climbing (HC) and Genetic Algorithm (GA), which the authors adapt to the main PRDW design problem over heterogeneous database clusters, as these algorithms are capable of taking into account the heterogeneous characteristics of the reference application scenario. At the allocation phase, F&A introduces an innovative matrix-based formalism capable of capturing the interactions among fragments, input queries, and cluster node characteristics, driving the data allocation task accordingly, and a related affinity-based algorithm, called F&A-ALLOC. Finally, their proposal is experimentally assessed and validated against the widely-known data warehouse benchmark APB-1 release II.


2008 ◽  
pp. 718-737
Author(s):  
Pedro Furtado

Data Warehouses (DWs) with large quantities of data present major performance and scalability challenges, and parallelism can be used for major performance improvement in such context. However, instead of costly specialized parallel hardware and interconnections, we focus on low-cost standard computing nodes, possibly in a non-dedicated local network. In this environment, special care must be taken with partitioning and processing. We use experimental evidence to analyze the shortcomings of a basic horizontal partitioning strategy designed for that environment, then propose and test improvements to allow efficient placement for the low-cost Node Partitioned Data Warehouse. We show experimentally that extra overheads related to processing large replicated relations and repartitioning requirements between nodes can significantly degrade speedup performance for many query patterns. We analyze a simple, easy-to-apply partitioning and placement decision that achieves good performance improvement results. Our experiments and discussion provide important insight into partitioning and processing issues for data warehouses in shared-nothing environments.


2019 ◽  
Vol 15 (3) ◽  
pp. 46-62
Author(s):  
Canan Eren Atay ◽  
Georgia Garani

A data warehouse is considered a key aspect of success for any decision support system. Research on temporal databases have produced important results in this field, and data warehouses, which store historical data, can clearly benefit from such studies. A slowly changing dimension is a dimension in which any of its attributes in a data warehouse can change infrequently over time. Although different solutions have been proposed, each has its own particular disadvantages. The authors propose the Object-Relational Temporal Data Warehouse (O-RTDW) model for the slowly changing dimensions in this research work. Using this approach, it is possible to keep track of the whole history of an object in a data warehouse efficiently. The proposed model has been implemented on a real data set and tested successfully. Several limitations implied in other solutions, such as redundancy, surrogate keys, incomplete historical data, and creation of additional tables are not present in our solution.


Author(s):  
Pedro Furtado

Data Warehouses (DWs) with large quantities of data present major performance and scalability challenges, and parallelism can be used for major performance improvement in such context. However, instead of costly specialized parallel hardware and interconnections, we focus on low-cost standard computing nodes, possibly in a non-dedicated local network. In this environment, special care must be taken with partitioning and processing. We use experimental evidence to analyze the shortcomings of a basic horizontal partitioning strategy designed for that environment, then propose and test improvements to allow efficient placement for the low-cost Node Partitioned Data Warehouse. We show experimentally that extra overheads related to processing large replicated relations and repartitioning requirements between nodes can significantly degrade speedup performance for many query patterns. We analyze a simple, easy-to-apply partitioning and placement decision that achieves good performance improvement results. Our experiments and discussion provide important insight into partitioning and processing issues for data warehouses in shared-nothing environments.


2021 ◽  
pp. 014544552110540
Author(s):  
Nihal Sen

The purpose of this study is to provide a brief introduction to effect size calculation in single-subject design studies, including a description of nonparametric and regression-based effect sizes. We then focus the rest of the tutorial on common regression-based methods used to calculate effect size in single-subject experimental studies. We start by first describing the difference between five regression-based methods (Gorsuch, White et al., Center et al., Allison and Gorman, Huitema and McKean). This is followed by an example using the five regression-based effect size methods and a demonstration how these methods can be applied using a sample data set. In this way, the question of how the values obtained from different effect size methods differ was answered. The specific regression models used in these five regression-based methods and how these models can be obtained from the SPSS program were shown. R2 values obtained from these five methods were converted to Cohen’s d value and compared in this study. The d values obtained from the same data set were estimated as 0.003, 0.357, 2.180, 3.470, and 2.108 for the Allison and Gorman, Gorsuch, White et al., Center et al., as well as for Huitema and McKean methods, respectively. A brief description of selected statistical programs available to conduct regression-based methods was given.


2021 ◽  
pp. 36-43
Author(s):  
L. A. Demidova ◽  
A. V. Filatov

The article considers an approach to solving the problem of monitoring and classifying the states of hard disks, which is solved on a regular basis, within the framework of the concept of non-destructive testing. It is proposed to solve this problem by developing a classification model using machine learning algorithms, in particular, using recurrent neural networks with Simple RNN, LSTM and GRU architectures. To develop a classification model, a data set based on the values of SMART sensors installed on hard disks it used. It represents a group of multidimensional time series. At the same time, the structure of the classification model contains two layers of a neural network with one of the recurrent architectures, as well as a Dropout layer and a Dense layer. The results of experimental studies confirming the advantages of LSTM and GRU architectures as part of hard disk state classification models are presented.


2007 ◽  
Vol 100 (1) ◽  
pp. 294-302 ◽  
Author(s):  
Elizabeth H. Chaney ◽  
J. Don Chaney ◽  
Min Qi Wang ◽  
James M. Eddy

The purpose of this study was to test the hypothesis that individuals reporting healthy lifestyle behaviors would also report better self-rated mental health. Logistic regression analyses were conducted utilizing SUDAAN on the Behavioral Risk Factor Surveillance Survey data set. This descriptive analysis suggests that persons reporting poor mental health were more likely to report unhealthy lifestyle behaviors. This set of findings encourages careful design of experimental studies of empirically based associations of mental health and life style, using psychometrically sound measures. Then public health programs focused on change of health-related behaviors might be more suitably devised.


2010 ◽  
Vol 2 (1) ◽  
pp. 99-116
Author(s):  
Katarzyna Rostek

Data Analytical Processing in Data Warehouses The article presents issues connected with processing information from data warehouses (the analytical enterprise databases) and two basic types of analytical data processing in data warehouse. The genesis, main definitions, scope of application and real examples from business implementations will be described for each type of analysis. There will be presented copyrighted method of knowledge discovering in databases, together with practical guidelines for its proper and effective use in the enterprise.


Sign in / Sign up

Export Citation Format

Share Document