A Methodology Supporting the Design and Evaluating the Final Quality of Data Warehouses

Author(s):  
Pighin Maurizio ◽  
Ieronutti Lucio

The design and configuration of a data warehouse can be difficult tasks especially in the case of very large databases and in the presence of redundant information. In particular, the choice of which attributes have to be considered as dimensions and measures can be not trivial and it can heavily influence the effectiveness of the final system. In this article, we propose a methodology targeted at supporting the design and deriving information on the total quality of the final data warehouse. We tested our proposal on three real-world commercial ERP databases.

2009 ◽  
pp. 615-636
Author(s):  
Maurizio Pighin ◽  
Lucio Ieronutti

The design and configuration of a data warehouse can be difficult tasks especially in the case of very large databases and in the presence of redundant information. In particular, the choice of which attributes have to be considered as dimensions and measures can be not trivial and it can heavily influence the effectiveness of the final system. In this article, we propose a methodology targeted at supporting the design and deriving information on the total quality of the final data warehouse. We tested our proposal on three real-word commercial ERP databases.


2010 ◽  
pp. 397-417
Author(s):  
Maurizio Pighin ◽  
Lucio Ieronutti

The design and configuration of a data warehouse can be difficult tasks especially in the case of very large databases and in the presence of redundant information. In particular, the choice of which attributes have to be considered as dimensions and measures can be not trivial and it can heavily influence the effectiveness of the final system. In this article, we propose a methodology targeted at supporting the design and deriving information on the total quality of the final data warehouse. We tested our proposal on three real-word commercial ERP databases.


Robotica ◽  
2014 ◽  
Vol 33 (5) ◽  
pp. 1131-1146
Author(s):  
Jimmy A. Rytz ◽  
Lars-Peter Ellekilde ◽  
Dirk Kraft ◽  
Henrik G. Petersen ◽  
Norbert Krüger

SUMMARYIt has become a common practice to use simulation to generate large databases of good grasps for grasp planning in robotics research. However, the existence of a generic simulation context that enables the generation of high quality grasps that can be used in several different contexts such as bin-picking or picking objects from a table, has to our knowledge not yet been discussed in the literature.In this paper, we investigate how well the quality of grasps simulated in a commonly used “generic” context transfers to a specific context, both, in simulation and in the real world.We generate a large database of grasp hypotheses for several objects and grippers, which we then evaluate in different dynamic simulation contexts e.g., free floating (no gravity, no obstacles), standing on a table and lying on a table.We present a comparison on the intersection of the grasp outcome space across the different contexts and quantitatively show that to generate reliable grasp databases, it is important to use context specific simulation.We furthermore evaluate how well a state of the art grasp database transfers from two simulated contexts to a real world context of picking an object from a table and discuss how to evaluate transferability into non-deterministic real world contexts.


2015 ◽  
Vol 2015 ◽  
pp. 1-14 ◽  
Author(s):  
Jaemun Sim ◽  
Jonathan Sangyun Lee ◽  
Ohbyung Kwon

In a ubiquitous environment, high-accuracy data analysis is essential because it affects real-world decision-making. However, in the real world, user-related data from information systems are often missing due to users’ concerns about privacy or lack of obligation to provide complete data. This data incompleteness can impair the accuracy of data analysis using classification algorithms, which can degrade the value of the data. Many studies have attempted to overcome these data incompleteness issues and to improve the quality of data analysis using classification algorithms. The performance of classification algorithms may be affected by the characteristics and patterns of the missing data, such as the ratio of missing data to complete data. We perform a concrete causal analysis of differences in performance of classification algorithms based on various factors. The characteristics of missing values, datasets, and imputation methods are examined. We also propose imputation and classification algorithms appropriate to different datasets and circumstances.


1998 ◽  
Vol 07 (02n03) ◽  
pp. 215-247 ◽  
Author(s):  
MATTEO GOLFARELLI ◽  
DARIO MAIO ◽  
STEFANO RIZZI

Data warehousing systems enable enterprise managers to acquire and integrate information from heterogeneous sources and to query very large databases efficiently. Building a data warehouse requires adopting design and implementation techniques completely different from those underlying operational information systems. Though most scientific literature on the design of data warehouses concerns their logical and physical models, an accurate conceptual design is the necessary foundations for building a DW which is well-documented and fully satisfies requirements. In this paper we formalize a graphical conceptual model for data warehouses, called Dimensional Fact model, and propose a semi-automated methodology to build it from the pre-existing (conceptual or logical) schemes describing the enterprise relational database. The representation of reality built using our conceptual model consists of a set of fact schemes whose basic elements are facts, measures, attributes, dimensions and hierarchies; other features which may be represented on fact schemes are the additivity of fact attributes along dimensions, the optionality of dimension attributes and the existence of non-dimension attributes. Compatible fact schemes may be overlapped in order to relate and compare data for drill-across queries. Fact schemes should be integrated with information of the conjectured workload, to be used as the input of logical and physical design phases; to this end, we propose a simple language to denote data warehouse queries in terms of sets of fact instances.


2012 ◽  
Vol 6 (3) ◽  
pp. 255 ◽  
Author(s):  
Anjana Gosain ◽  
Sangeeta Sabharwal ◽  
Sushama Nagpal

Testing is very essential in Data warehouse systems for decision making because the accuracy, validation and correctness of data depends on it. By looking to the characteristics and complexity of iData iwarehouse, iin ithis ipaper, iwe ihave itried ito ishow the scope of automated testing in assuring ibest data iwarehouse isolutions. Firstly, we developed a data set generator for creating synthetic but near to real data; then in isynthesized idata, with ithe help of hand icoded Extraction, Transformation and Loading (ETL) routine, anomalies are classified. For the quality assurance of data for a Data warehouse and to give the idea of how important the iExtraction, iTransformation iand iLoading iis, some very important test cases were identified. After that, to ensure the quality of data, the procedures of automated testing iwere iembedded iin ihand icoded iETL iroutine. Statistical analysis was done and it revealed a big enhancement in the quality of data with the procedures of automated testing. It enhances the fact that automated testing gives promising results in the data warehouse quality. For effective and easy maintenance of distributed data,a novel architecture was proposed. Although the desired result of this research is achieved successfully and the objectives are promising, but still there's a need to validate the results with the real life environment, as this research was done in simulated environment, which may not always give the desired results in real life environment. Hence, the overall potential of the proposed architecture can be seen until it is deployed to manage the real data which is distributed globally.


2017 ◽  
Author(s):  
Christoph Endrullat

Introduction 2nd generation sequencing or better known as next-generation sequencing (NGS) represents a cutting-edge technology in life sciences and current foundation for unravelling nucleotide sequences. Since advent of first platforms in 2005 the number of different types of NGS platforms increased in the last 10 years in the same manner as the variety of possible applications. Higher throughput, lower cost and better quality of data were the incentive for a range of enterprises developing new NGS devices, whereas economic issues and competitive pressure, based on expensive workflows of obsolete systems and decreasing cost of market leader platforms, resulted simultaneously in accelerated vanishing of several companies. Due to the fast development, NGS is currently characterized by a lack of standard operating procedures, quality management/quality assurance specifications, proficiency testing systems and even less approved standards along with high cost and uncertainty of data quality. On the one hand, appropriate standardization approaches were already performed by different initiatives and projects in the format of accreditation checklists, technical notes and guidelines for validation of NGS workflows. On the other hand, these approaches are exclusively located in the US due to origins of NGS overseas, therefore there exists an obvious lack of European-based standardization initiatives. An additional problem represents the validity of promising standards across different NGS applications. Due to highest demands and regulations in specific areas like clinical diagnostics, the same standards, which will be established there, will not be applicable or reasonable in other applications. These points emphasize the importance of standardization in NGS mainly addressing the laboratory workflows, which are the prerequisite and foundation for sufficient quality of downstream results. Methods This work was based on a platform-dependent and -independent systematic literature review as well as personal communications with i.a. Illumina, Inc., ISO/TC 276 as well as DIN NA 057-06-02 AA 'Biotechnology'. Results Prior formulation of specific standard proposals and collection of current de facto standards, the problems of standardization in NGS itself were identified and interpreted. Therefore, a variety of different standardization approaches and projects from organizations, societies and companies were reviewed. Conclusions There is already a distinct number of NGS standardization efforts present; however, the majority of approaches target the standardization of the bioinformatics processing pipeline in the context of “Big Data”. Therefore, an essential prerequisite is the simplification and standardization of wet laboratory workflows, because respective steps are directly affecting the final data quality and thus there exists the demand to formulate experimental procedures to ensure a sufficient final data output quality.


2010 ◽  
Vol 7 (3) ◽  
Author(s):  
Giorgio Ghisalberti ◽  
Marco Masseroli ◽  
Luca Tettamanti

SummaryNumerous biomolecular data are available, but they are scattered in many databases and only some of them are curated by experts. Most available data are computationally derived and include errors and inconsistencies. Effective use of available data in order to derive new knowledge hence requires data integration and quality improvement. Many approaches for data integration have been proposed. Data warehousing seams to be the most adequate when comprehensive analysis of integrated data is required. This makes it the most suitable also to implement comprehensive quality controls on integrated data. We previously developed GFINDer (http://www.bioinformatics.polimi.it/GFINDer/), a web system that supports scientists in effectively using available information. It allows comprehensive statistical analysis and mining of functional and phenotypic annotations of gene lists, such as those identified by high-throughput biomolecular experiments. GFINDer backend is composed of a multi-organism genomic and proteomic data warehouse (GPDW). Within the GPDW, several controlled terminologies and ontologies, which describe gene and gene product related biomolecular processes, functions and phenotypes, are imported and integrated, together with their associations with genes and proteins of several organisms. In order to ease maintaining updated the GPDW and to ensure the best possible quality of data integrated in subsequent updating of the data warehouse, we developed several automatic procedures. Within them, we implemented numerous data quality control techniques to test the integrated data for a variety of possible errors and inconsistencies. Among other features, the implemented controls check data structure and completeness, ontological data consistency, ID format and evolution, unexpected data quantification values, and consistency of data from single and multiple sources. We use the implemented controls to analyze the quality of data available from several different biological databases and integrated in the GFINDer data warehouse. By doing so, we identified in these data a variety of different types of errors and inconsistencies; this enables us to ensure good quality of the data in the GFINDer data warehouse. We reported all identified data errors and inconsistencies to the curators of the original databases from where the data were retrieved, who mainly corrected them in subsequent updating of the original database. This contributed to improve the quality of the data available, in the original databases, to the whole scientific community.


Author(s):  
Piotr Bednarczuk

Very large databases like data warehouse slow down over time. This is usually due to a large daily increase in the data in the individual tables, counted in millions of records per day. How do we make sure our queries do not slow down over time? Table partitioning comes in handy, and, when used correctly, can ensure the smooth operation of very large databases with billions of records, even after several years.


Sign in / Sign up

Export Citation Format

Share Document