Data Warehousing Requirements Collection and Definition

Defining data warehouse requirements is widely recognized as one of the most important steps in the larger data warehouse system development process. This paper examines the potential risks and pitfalls within the data warehouse requirement collection and definition process. A real scenario of a large-scale data warehouse implementation is given, and details of this project, which ultimately failed due to inadequate requirement collection and definition process, are described. The presented case underscores and illustrates the impact of the requirement collection and definition process on the data warehouse implementation, while the case is analyzed within the context of the existing approaches, methodologies, and best practices for prevention and avoidance of typical data warehouse requirement errors and oversights.

Download Full-text

Multi-GPU approach to global induction of classification trees for large-scale data mining

Applied Intelligence ◽

10.1007/s10489-020-01952-5 ◽

2021 ◽

Author(s):

Krzysztof Jurczuk ◽

Marcin Czajkowski ◽

Marek Kretowski

Keyword(s):

Data Mining ◽

Large Scale ◽

Real Life ◽

Population Based ◽

Tree Structure ◽

Global Approach ◽

Data Parallel ◽

Large Scale Data ◽

The Impact ◽

Scale Data

AbstractThis paper concerns the evolutionary induction of decision trees (DT) for large-scale data. Such a global approach is one of the alternatives to the top-down inducers. It searches for the tree structure and tests simultaneously and thus gives improvements in the prediction and size of resulting classifiers in many situations. However, it is the population-based and iterative approach that can be too computationally demanding to apply for big data mining directly. The paper demonstrates that this barrier can be overcome by smart distributed/parallel processing. Moreover, we ask the question whether the global approach can truly compete with the greedy systems for large-scale data. For this purpose, we propose a novel multi-GPU approach. It incorporates the knowledge of global DT induction and evolutionary algorithm parallelization together with efficient utilization of memory and computing GPU’s resources. The searches for the tree structure and tests are performed simultaneously on a CPU, while the fitness calculations are delegated to GPUs. Data-parallel decomposition strategy and CUDA framework are applied. Experimental validation is performed on both artificial and real-life datasets. In both cases, the obtained acceleration is very satisfactory. The solution is able to process even billions of instances in a few hours on a single workstation equipped with 4 GPUs. The impact of data characteristics (size and dimension) on convergence and speedup of the evolutionary search is also shown. When the number of GPUs grows, nearly linear scalability is observed what suggests that data size boundaries for evolutionary DT mining are fading.

Download Full-text

A Layered Architecture Approach for Large-Scale Data Warehouse Systems

Lecture Notes in Business Information Processing - Information Systems: Methods, Models, and Applications ◽

10.1007/978-3-642-38370-0_20 ◽

2013 ◽

pp. 199-201

Author(s):

Thorsten Winsemann ◽

Veit Köppen ◽

Andreas Lübcke ◽

Gunter Saake

Keyword(s):

Data Warehouse ◽

Large Scale ◽

Layered Architecture ◽

Large Scale Data ◽

Scale Data

Download Full-text

Multi-omics investigations within the Phylum Mollusca, Class Gastropoda: from ecological application to breakthrough phylogenomic studies

Briefings in Functional Genomics ◽

10.1093/bfgp/elz017 ◽

2019 ◽

Cited By ~ 1

Author(s):

Anne H Klein ◽

Kaylene R Ballard ◽

Kenneth B Storey ◽

Cherie A Motti ◽

Min Zhao ◽

...

Keyword(s):

Large Scale ◽

Environmental Stressors ◽

Data Organization ◽

Large Scale Data ◽

Huge Data ◽

The Future ◽

Ecological Application ◽

The Impact ◽

Generation Sequencing ◽

Scale Data

Abstract Gastropods are the largest and most diverse class of mollusc and include species that are well studied within the areas of taxonomy, aquaculture, biomineralization, ecology, microbiome and health. Gastropod research has been expanding since the mid-2000s, largely due to large-scale data integration from next-generation sequencing and mass spectrometry in which transcripts, proteins and metabolites can be readily explored systematically. Correspondingly, the huge data added a great deal of complexity for data organization, visualization and interpretation. Here, we reviewed the recent advances involving gastropod omics (‘gastropodomics’) research from hundreds of publications and online genomics databases. By summarizing the current publicly available data, we present an insight for the design of useful data integrating tools and strategies for comparative omics studies in the future. Additionally, we discuss the future of omics applications in aquaculture, natural pharmaceutical biodiscovery and pest management, as well as to monitor the impact of environmental stressors.

Download Full-text

Large scale data warehousing: Trends and observations

2010 IEEE 26th International Conference on Data Engineering (ICDE 2010) ◽

10.1109/icde.2010.5447634 ◽

2010 ◽

Cited By ~ 2

Author(s):

Richard Winter ◽

Pekka Kostamaa

Keyword(s):

Large Scale ◽

Data Warehousing ◽

Large Scale Data ◽

Scale Data

Download Full-text

A Document-Based Data Warehousing Approach for Large Scale Data Mining

Pervasive Computing and the Networked World - Lecture Notes in Computer Science ◽

10.1007/978-3-642-37015-1_7 ◽

2013 ◽

pp. 69-81 ◽

Cited By ~ 4

Author(s):

Hualei Chai ◽

Gang Wu ◽

Yuan Zhao

Keyword(s):

Data Mining ◽

Large Scale ◽

Data Warehousing ◽

Large Scale Data ◽

Scale Data

Download Full-text

Towards the Development of Large-Scale Data Warehouse Application Frameworks

Re-conceptualizing Enterprise Information Systems - Lecture Notes in Business Information Processing ◽

10.1007/978-3-642-28827-2_7 ◽

2012 ◽

pp. 92-104 ◽

Cited By ~ 1

Author(s):

Duong Thi Anh Hoang ◽

Hieu Tran ◽

Binh Thanh Nguyen ◽

A Min Tjoa

Keyword(s):

Data Warehouse ◽

Large Scale ◽

Large Scale Data ◽

Scale Data

Download Full-text

A Context-Based Performance Enhancement Algorithm for Columnar Storage in MapReduce with Hive

International Journal of Cloud Applications and Computing ◽

10.4018/ijcac.2013100104 ◽

2013 ◽

Vol 3 (4) ◽

pp. 38-50 ◽

Cited By ~ 1

Author(s):

Yashvardhan Sharma ◽

Saurabh Verma ◽

Sumit Kumar ◽

Shivam U.

Keyword(s):

Data Warehouse ◽

Large Scale ◽

Performance Enhancement ◽

High Reliability ◽

Data Management System ◽

Mapreduce Framework ◽

Large Scale Data ◽

Query Engine ◽

Commodity Clusters ◽

Scale Data

To achieve high reliability and scalability, most large-scale data warehouse systems have adopted the cluster-based architecture. In this context, MapReduce has emerged as a promising architecture for large scale data warehousing and data analytics on commodity clusters. The MapReduce framework offers several lucrative features such as high fault-tolerance, scalability and use of a variety of hardware from low to high range. But these benefits have resulted in substantial performance compromise. In this paper, we propose the design of a novel cluster-based data warehouse system, Daenyrys for data processing on Hadoop – an open source implementation of the MapReduce framework under the umbrella of Apache. Daenyrys is a data management system which has the capability to take decision about the optimum partitioning scheme for the Hadoop's distributed file system (DFS). The optimum partitioning scheme improves the performance of the complete framework. The choice of the optimum partitioning is query-context dependent. In Daenyrys, the columns are formed into optimized groups to provide the basis for the partitioning of tables vertically. Daenyrys has an algorithm that monitors the context of current queries and based on the observations, it re-partitions the DFS for better performance and resource utilization. In the proposed system, Hive, a MapReduce-based SQL-like query engine is supported above the DFS.

Download Full-text

HDW: A High Performance Large Scale Data Warehouse

2008 International Multi-symposiums on Computer and Computational Sciences ◽

10.1109/imsccs.2008.16 ◽

2008 ◽

Cited By ~ 3

Author(s):

Jinguo You ◽

Jianqing Xi ◽

Chuan Zhang ◽

Gengqi Guo

Keyword(s):

Data Warehouse ◽

High Performance ◽

Large Scale ◽

Large Scale Data ◽

Scale Data

Download Full-text

Concurrency and Interference Analysis of Kernels on GPUs

10.5753/ctd.2021.15757 ◽

2021 ◽

Author(s):

Pablo Carvalho ◽

Lúcia Maria De A. Drummond ◽

Cristiana Bentes

Keyword(s):

Large Scale ◽

Data Centers ◽

Heterogeneous Systems ◽

System Throughput ◽

Concurrent Execution ◽

Large Scale Data ◽

Resource Requirements ◽

Cloud Environments ◽

The Impact ◽

Scale Data

Heterogeneous systems employing CPUs and GPUs are becoming increasingly popular in large-scale data centers and cloud environments. In these platforms, sharing a GPU across different applications is an important feature to improve hardware utilization and system throughput. However, under scenarios where GPUs are competitively shared, some challenges arise. The decision on the simultaneous execution of different kernels is made by the hardware and depends on the kernels resource requirements. Besides that, it is very difficult to understand all the hardware variables involved in the simultaneous execution decisions, in order to describe a formal allocation method. In this work, we studied the impact that kernel resource requirements have in concurrent execution and used machine learning (ML) techniques to infer the interference caused by the concurrent execution, and to classify the slowdown that results from this interference. The ML techniques were analyzed over the GPU benchmark suites, Rodinia, Parboil and SHOC. Our results showed that, from the features selected in the analysis, the number of blocks per grid, number of threads per block, and number of registers are the resource consumption features that most affect the performance of the concurrent execution.

Download Full-text