Efficient and Portable Distribution Modeling for Large-Scale Scientific Data Processing with Data-Parallel Primitives

The use of distribution-based data representation to handle large-scale scientific datasets is a promising approach. Distribution-based approaches often transform a scientific dataset into many distributions, each of which is calculated from a small number of samples. Most of the proposed parallel algorithms focus on modeling single distributions from many input samples efficiently, but these may not fit the large-scale scientific data processing scenario because they cannot utilize computing resources effectively. Histograms and the Gaussian Mixture Model (GMM) are the most popular distribution representations used to model scientific datasets. Therefore, we propose the use of multi-set histogram and GMM modeling algorithms for the scenario of large-scale scientific data processing. Our algorithms are developed by data-parallel primitives to achieve portability across different hardware architectures. We evaluate the performance of the proposed algorithms in detail and demonstrate use cases for scientific data processing.

Download Full-text

Architectural concepts for managing biomedical sensor data utilised for medical diagnosis and patient remote care.

MATEC Web of Conferences ◽

10.1051/matecconf/201821005016 ◽

2018 ◽

Vol 210 ◽

pp. 05016

Author(s):

Mariusz Chmielewski ◽

Damian Frąszczak ◽

Dawid Bugajewski

Keyword(s):

Data Processing ◽

Data Acquisition ◽

Large Scale ◽

Data Representation ◽

Medical Diagnostics ◽

Mobile Technologies ◽

Sensor Data ◽

Biomedical Data ◽

Support Tool ◽

Decision Algorithms

This paper discusses experiences and architectural concepts developed and tested aimed at acquisition and processing of biomedical data in large scale system for elderly (patients) monitoring. Major assumptions for the research included utilisation of wearable and mobile technologies, supporting maximum number of inertial and biomedical data to support decision algorithms. Although medical diagnostics and decision algorithms have not been the main aim of the research, this preliminary phase was crucial to test capabilities of existing off-the-shelf technologies and functional responsibilities of system’s logic components. Architecture variants contained several schemes for data processing moving the responsibility for signal feature extraction, data classification and pattern recognition from wearable to mobile up to server facilities. Analysis of transmission and processing delays provided architecture variants pros and cons but most of all knowledge about applicability in medical, military and fitness domains. To evaluate and construct architecture, a set of alternative technology stacks and quantitative measures has been defined. The major architecture characteristics (high availability, scalability, reliability) have been defined imposing asynchronous processing of sensor data, efficient data representation, iterative reporting, event-driven processing, restricting pulling operations. Sensor data processing persist the original data on handhelds but is mainly aimed at extracting chosen set of signal features calculated for specific time windows – varying for analysed signals and the sensor data acquisition rates. Long term monitoring of patients requires also development of mechanisms, which probe the patient and in case of detecting anomalies or drastic characteristic changes tune the data acquisition process. This paper describes experiences connected with design of scalable decision support tool and evaluation techniques for architectural concepts implemented within the mobile and server software.

Download Full-text

Optimal hot spot allocation on meshes for large-scale data-parallel algorithms

IEEE Transactions on Parallel and Distributed Systems ◽

10.1109/71.406956 ◽

1995 ◽

Vol 6 (8) ◽

pp. 788-802 ◽

Cited By ~ 1

Author(s):

Soo-Young Lee ◽

Chung-Ming Chen

Keyword(s):

Parallel Algorithms ◽

Large Scale ◽

Hot Spot ◽

Data Parallel ◽

Large Scale Data ◽

Scale Data

Download Full-text

Replication of uniformly accessed shared data for large-scale data-parallel algorithms

Proceedings of 9th International Parallel Processing Symposium ◽

10.1109/ipps.1995.395940 ◽

2002 ◽

Author(s):

Chung-Ming Chen ◽

Soo-Young Lee

Keyword(s):

Parallel Algorithms ◽

Large Scale ◽

Data Parallel ◽

Shared Data ◽

Large Scale Data ◽

Scale Data

Download Full-text

Data-parallel algorithms for large-scale real-time simulation of the cellular potts model on graphics processing units

2009 IEEE International Conference on Systems, Man and Cybernetics ◽

10.1109/icsmc.2009.5346282 ◽

2009 ◽

Cited By ~ 1

Author(s):

Jose Juan Tapia ◽

Roshan D'Souza

Keyword(s):

Parallel Algorithms ◽

Real Time ◽

Potts Model ◽

Graphics Processing Units ◽

Large Scale ◽

Cellular Potts Model ◽

Real Time Simulation ◽

Data Parallel ◽

Time Simulation ◽

Graphics Processing

Download Full-text

The Data Allocation Strategy Based on Load in NoSQL Database

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.513-517.1464 ◽

2014 ◽

Vol 513-517 ◽

pp. 1464-1469 ◽

Cited By ~ 3

Author(s):

Zhi Kun Chen ◽

Shu Qiang Yang ◽

Shuang Tan ◽

Hui Zhao ◽

Li He ◽

...

Keyword(s):

Big Data ◽

Data Processing ◽

Large Scale ◽

Internet Technology ◽

Data Allocation ◽

Allocation Strategy ◽

Data Parallel ◽

Large Scale Data ◽

Nosql Database ◽

Parallel Data

With the development of Internet technology and Cloud Computing, more and more applications have to be confronted with the challenges of big data. NoSQL Database is fit to the management of big data because of the characteristics of high scalability, high availability and high fault-tolerance. And it is one of the technologies of the management of big data. We will improve the performance of massive data processing of NoSQL Database through the large scale data parallel data processing and data localize of computing. So how to allocate the data will be a big challenge of NoSQL Database. In this paper we will propose a data allocation strategy based on the nodes load, which can adjust the data allocation strategy by the execute status of the system. And it can keep the balance of data allocation by a small cost. At last we will use some experiments to verify the effectiveness of the strategy which is proposed in this paper. The experiments show that it can improve the systems performance than other allocation strategy.

Download Full-text

Multi-GPU approach to global induction of classification trees for large-scale data mining

Applied Intelligence ◽

10.1007/s10489-020-01952-5 ◽

2021 ◽

Author(s):

Krzysztof Jurczuk ◽

Marcin Czajkowski ◽

Marek Kretowski

Keyword(s):

Data Mining ◽

Large Scale ◽

Real Life ◽

Population Based ◽

Tree Structure ◽

Global Approach ◽

Data Parallel ◽

Large Scale Data ◽

The Impact ◽

Scale Data

AbstractThis paper concerns the evolutionary induction of decision trees (DT) for large-scale data. Such a global approach is one of the alternatives to the top-down inducers. It searches for the tree structure and tests simultaneously and thus gives improvements in the prediction and size of resulting classifiers in many situations. However, it is the population-based and iterative approach that can be too computationally demanding to apply for big data mining directly. The paper demonstrates that this barrier can be overcome by smart distributed/parallel processing. Moreover, we ask the question whether the global approach can truly compete with the greedy systems for large-scale data. For this purpose, we propose a novel multi-GPU approach. It incorporates the knowledge of global DT induction and evolutionary algorithm parallelization together with efficient utilization of memory and computing GPU’s resources. The searches for the tree structure and tests are performed simultaneously on a CPU, while the fitness calculations are delegated to GPUs. Data-parallel decomposition strategy and CUDA framework are applied. Experimental validation is performed on both artificial and real-life datasets. In both cases, the obtained acceleration is very satisfactory. The solution is able to process even billions of instances in a few hours on a single workstation equipped with 4 GPUs. The impact of data characteristics (size and dimension) on convergence and speedup of the evolutionary search is also shown. When the number of GPUs grows, nearly linear scalability is observed what suggests that data size boundaries for evolutionary DT mining are fading.

Download Full-text