scholarly journals Multi-GPU approach to global induction of classification trees for large-scale data mining

Author(s):  
Krzysztof Jurczuk ◽  
Marcin Czajkowski ◽  
Marek Kretowski

AbstractThis paper concerns the evolutionary induction of decision trees (DT) for large-scale data. Such a global approach is one of the alternatives to the top-down inducers. It searches for the tree structure and tests simultaneously and thus gives improvements in the prediction and size of resulting classifiers in many situations. However, it is the population-based and iterative approach that can be too computationally demanding to apply for big data mining directly. The paper demonstrates that this barrier can be overcome by smart distributed/parallel processing. Moreover, we ask the question whether the global approach can truly compete with the greedy systems for large-scale data. For this purpose, we propose a novel multi-GPU approach. It incorporates the knowledge of global DT induction and evolutionary algorithm parallelization together with efficient utilization of memory and computing GPU’s resources. The searches for the tree structure and tests are performed simultaneously on a CPU, while the fitness calculations are delegated to GPUs. Data-parallel decomposition strategy and CUDA framework are applied. Experimental validation is performed on both artificial and real-life datasets. In both cases, the obtained acceleration is very satisfactory. The solution is able to process even billions of instances in a few hours on a single workstation equipped with 4 GPUs. The impact of data characteristics (size and dimension) on convergence and speedup of the evolutionary search is also shown. When the number of GPUs grows, nearly linear scalability is observed what suggests that data size boundaries for evolutionary DT mining are fading.

2018 ◽  
Vol 3 (1) ◽  
pp. 1-18
Author(s):  
Kislaya Kunjan ◽  
Huanmei Wu ◽  
Tammy R. Toscos ◽  
Bradley N. Doebbeling

Author(s):  
Nenad Jukic ◽  
Miguel Velasco

Defining data warehouse requirements is widely recognized as one of the most important steps in the larger data warehouse system development process. This paper examines the potential risks and pitfalls within the data warehouse requirement collection and definition process. A real scenario of a large-scale data warehouse implementation is given, and details of this project, which ultimately failed due to inadequate requirement collection and definition process, are described. The presented case underscores and illustrates the impact of the requirement collection and definition process on the data warehouse implementation, while the case is analyzed within the context of the existing approaches, methodologies, and best practices for prevention and avoidance of typical data warehouse requirement errors and oversights.


2010 ◽  
Vol 1 (3) ◽  
pp. 66-76
Author(s):  
Nenad Jukic ◽  
Miguel Velasco

Defining data warehouse requirements is widely recognized as one of the most important steps in the larger data warehouse system development process. This paper examines the potential risks and pitfalls within the data warehouse requirement collection and definition process. A real scenario of a large-scale data warehouse implementation is given, and details of this project, which ultimately failed due to inadequate requirement collection and definition process, are described. The presented case underscores and illustrates the impact of the requirement collection and definition process on the data warehouse implementation, while the case is analyzed within the context of the existing approaches, methodologies, and best practices for prevention and avoidance of typical data warehouse requirement errors and oversights.


Author(s):  
Anne H Klein ◽  
Kaylene R Ballard ◽  
Kenneth B Storey ◽  
Cherie A Motti ◽  
Min Zhao ◽  
...  

Abstract Gastropods are the largest and most diverse class of mollusc and include species that are well studied within the areas of taxonomy, aquaculture, biomineralization, ecology, microbiome and health. Gastropod research has been expanding since the mid-2000s, largely due to large-scale data integration from next-generation sequencing and mass spectrometry in which transcripts, proteins and metabolites can be readily explored systematically. Correspondingly, the huge data added a great deal of complexity for data organization, visualization and interpretation. Here, we reviewed the recent advances involving gastropod omics (‘gastropodomics’) research from hundreds of publications and online genomics databases. By summarizing the current publicly available data, we present an insight for the design of useful data integrating tools and strategies for comparative omics studies in the future. Additionally, we discuss the future of omics applications in aquaculture, natural pharmaceutical biodiscovery and pest management, as well as to monitor the impact of environmental stressors.


2004 ◽  
Vol 19 (1) ◽  
pp. 147-158 ◽  
Author(s):  
Matthew Ward ◽  
Wei Peng ◽  
Xiaoning Wang

Sign in / Sign up

Export Citation Format

Share Document