Multi-omics investigations within the Phylum Mollusca, Class Gastropoda: from ecological application to breakthrough phylogenomic studies

AbstractThis paper concerns the evolutionary induction of decision trees (DT) for large-scale data. Such a global approach is one of the alternatives to the top-down inducers. It searches for the tree structure and tests simultaneously and thus gives improvements in the prediction and size of resulting classifiers in many situations. However, it is the population-based and iterative approach that can be too computationally demanding to apply for big data mining directly. The paper demonstrates that this barrier can be overcome by smart distributed/parallel processing. Moreover, we ask the question whether the global approach can truly compete with the greedy systems for large-scale data. For this purpose, we propose a novel multi-GPU approach. It incorporates the knowledge of global DT induction and evolutionary algorithm parallelization together with efficient utilization of memory and computing GPU’s resources. The searches for the tree structure and tests are performed simultaneously on a CPU, while the fitness calculations are delegated to GPUs. Data-parallel decomposition strategy and CUDA framework are applied. Experimental validation is performed on both artificial and real-life datasets. In both cases, the obtained acceleration is very satisfactory. The solution is able to process even billions of instances in a few hours on a single workstation equipped with 4 GPUs. The impact of data characteristics (size and dimension) on convergence and speedup of the evolutionary search is also shown. When the number of GPUs grows, nearly linear scalability is observed what suggests that data size boundaries for evolutionary DT mining are fading.

Download Full-text

Data Warehousing Requirements Collection and Definition

Organizational Applications of Business Intelligence Management ◽

10.4018/978-1-4666-0279-3.ch018 ◽

2012 ◽

pp. 261-271

Author(s):

Nenad Jukic ◽

Miguel Velasco

Keyword(s):

Data Warehouse ◽

Large Scale ◽

Data Warehousing ◽

System Development ◽

Large Scale Data ◽

Typical Data ◽

Potential Risks ◽

Real Scenario ◽

The Impact ◽

Scale Data

Defining data warehouse requirements is widely recognized as one of the most important steps in the larger data warehouse system development process. This paper examines the potential risks and pitfalls within the data warehouse requirement collection and definition process. A real scenario of a large-scale data warehouse implementation is given, and details of this project, which ultimately failed due to inadequate requirement collection and definition process, are described. The presented case underscores and illustrates the impact of the requirement collection and definition process on the data warehouse implementation, while the case is analyzed within the context of the existing approaches, methodologies, and best practices for prevention and avoidance of typical data warehouse requirement errors and oversights.

Download Full-text

On the Effectiveness of Hybrid Canopy with Hoeffding Adaptive Naive Bayes Trees

International Journal of Applied Evolutionary Computation ◽

10.4018/ijaec.2017040102 ◽

2017 ◽

Vol 8 (2) ◽

pp. 30-43

Author(s):

Mrutyunjaya Panda

Keyword(s):

Big Data ◽

Clustering Analysis ◽

Large Scale ◽

Data Sets ◽

Recent Past ◽

Large Scale Data ◽

Huge Data ◽

With Memory ◽

Memory Constraints ◽

Scale Data

The Big Data, due to its complicated and diverse nature, poses a lot of challenges for extracting meaningful observations. This sought smart and efficient algorithms that can deal with computational complexity along with memory constraints out of their iterative behavior. This issue may be solved by using parallel computing techniques, where a single machine or a multiple machine can perform the work simultaneously, dividing the problem into sub problems and assigning some private memory to each sub problems. Clustering analysis are found to be useful in handling such a huge data in the recent past. Even though, there are many investigations in Big data analysis are on, still, to solve this issue, Canopy and K-Means++ clustering are used for processing the large-scale data in shorter amount of time with no memory constraints. In order to find the suitability of the approach, several data sets are considered ranging from small to very large ones having diverse filed of applications. The experimental results opine that the proposed approach is fast and accurate.

Download Full-text

Data Warehousing Requirements Collection and Definition

International Journal of Business Intelligence Research ◽

10.4018/jbir.2010070105 ◽

2010 ◽

Vol 1 (3) ◽

pp. 66-76

Author(s):

Nenad Jukic ◽

Miguel Velasco

Keyword(s):

Data Warehouse ◽

Large Scale ◽

Development Process ◽

System Development ◽

Large Scale Data ◽

Typical Data ◽

Potential Risks ◽

Real Scenario ◽

The Impact ◽

Scale Data

Defining data warehouse requirements is widely recognized as one of the most important steps in the larger data warehouse system development process. This paper examines the potential risks and pitfalls within the data warehouse requirement collection and definition process. A real scenario of a large-scale data warehouse implementation is given, and details of this project, which ultimately failed due to inadequate requirement collection and definition process, are described. The presented case underscores and illustrates the impact of the requirement collection and definition process on the data warehouse implementation, while the case is analyzed within the context of the existing approaches, methodologies, and best practices for prevention and avoidance of typical data warehouse requirement errors and oversights.

Download Full-text

A Ranking-Based Hashing Algorithm Based on the Distributed Spark Platform

Information ◽

10.3390/info11030148 ◽

2020 ◽

Vol 11 (3) ◽

pp. 148

Author(s):

Anbang Yang ◽

Jiangbo Qian ◽

Huahui Chen ◽

Yihong Dong

Keyword(s):

Large Scale ◽

Rapid Development ◽

Modern Society ◽

Training Time ◽

Large Scale Data ◽

Huge Data ◽

Hashing Algorithm ◽

Similarity Searches ◽

Spark Framework ◽

Scale Data

With the rapid development of modern society, generated data has increased exponentially. Finding required data from this huge data pool is an urgent problem that needs to be solved. Hashing technology is widely used in similarity searches of large-scale data. Among them, the ranking-based hashing algorithm has been widely studied due to its accuracy and speed regarding the search results. At present, most ranking-based hashing algorithms construct loss functions by comparing the rank consistency of data in Euclidean and Hamming spaces. However, most of them have high time complexity and long training times, meaning they cannot meet requirements. In order to solve these problems, this paper introduces a distributed Spark framework and implements the ranking-based hashing algorithm in a parallel environment on multiple machines. The experimental results show that the Spark-RLSH (Ranking Listwise Supervision Hashing) can greatly reduce the training time and improve the training efficiency compared with other ranking-based hashing algorithms.

Download Full-text

Next-Generation Sequencing or the Dilemma of Large-Scale Data Analysis: Opportunities, Insights, and Challenges to Translational, Preventive and Personalized Medicine

Journal of Investigative Genomics ◽

10.15406/jig.2014.01.00005 ◽

2014 ◽

Vol 1 (1) ◽

Author(s):

Farid Menaa

Keyword(s):

Data Analysis ◽

Next Generation Sequencing ◽

Personalized Medicine ◽

Large Scale ◽

Next Generation ◽

Large Scale Data ◽

Preventive And Personalized Medicine ◽

And Personalized Medicine ◽

Generation Sequencing ◽

Scale Data

Download Full-text

On the Effectiveness of Hybrid Canopy With Hoeffding Adaptive Naive Bayes Trees

Web Services ◽

10.4018/978-1-5225-7501-6.ch043 ◽

2019 ◽

pp. 788-802

Author(s):

Mrutyunjaya Panda

Keyword(s):

Big Data ◽

Data Analysis ◽

Large Scale ◽

Big Data Analysis ◽

Data Sets ◽

Large Scale Data ◽

Huge Data ◽

With Memory ◽

Memory Constraints ◽

Scale Data

The Big Data, due to its complicated and diverse nature, poses a lot of challenges for extracting meaningful observations. This sought smart and efficient algorithms that can deal with computational complexity along with memory constraints out of their iterative behavior. This issue may be solved by using parallel computing techniques, where a single machine or a multiple machine can perform the work simultaneously, dividing the problem into sub problems and assigning some private memory to each sub problems. Clustering analysis are found to be useful in handling such a huge data in the recent past. Even though, there are many investigations in Big data analysis are on, still, to solve this issue, Canopy and K-Means++ clustering are used for processing the large-scale data in shorter amount of time with no memory constraints. In order to find the suitability of the approach, several data sets are considered ranging from small to very large ones having diverse filed of applications. The experimental results opine that the proposed approach is fast and accurate.

Download Full-text

Concurrency and Interference Analysis of Kernels on GPUs

10.5753/ctd.2021.15757 ◽

2021 ◽

Author(s):

Pablo Carvalho ◽

Lúcia Maria De A. Drummond ◽

Cristiana Bentes

Keyword(s):

Large Scale ◽

Data Centers ◽

Heterogeneous Systems ◽

System Throughput ◽

Concurrent Execution ◽

Large Scale Data ◽

Resource Requirements ◽

Cloud Environments ◽

The Impact ◽

Scale Data

Heterogeneous systems employing CPUs and GPUs are becoming increasingly popular in large-scale data centers and cloud environments. In these platforms, sharing a GPU across different applications is an important feature to improve hardware utilization and system throughput. However, under scenarios where GPUs are competitively shared, some challenges arise. The decision on the simultaneous execution of different kernels is made by the hardware and depends on the kernels resource requirements. Besides that, it is very difficult to understand all the hardware variables involved in the simultaneous execution decisions, in order to describe a formal allocation method. In this work, we studied the impact that kernel resource requirements have in concurrent execution and used machine learning (ML) techniques to infer the interference caused by the concurrent execution, and to classify the slowdown that results from this interference. The ML techniques were analyzed over the GPU benchmark suites, Rodinia, Parboil and SHOC. Our results showed that, from the features selected in the analysis, the number of blocks per grid, number of threads per block, and number of registers are the resource consumption features that most affect the performance of the concurrent execution.

Download Full-text