scholarly journals Efficient and Portable Distribution Modeling for Large-Scale Scientific Data Processing with Data-Parallel Primitives

Algorithms ◽  
2021 ◽  
Vol 14 (10) ◽  
pp. 285
Author(s):  
Hao-Yi Yang ◽  
Zhi-Rong Lin ◽  
Ko-Chih Wang

The use of distribution-based data representation to handle large-scale scientific datasets is a promising approach. Distribution-based approaches often transform a scientific dataset into many distributions, each of which is calculated from a small number of samples. Most of the proposed parallel algorithms focus on modeling single distributions from many input samples efficiently, but these may not fit the large-scale scientific data processing scenario because they cannot utilize computing resources effectively. Histograms and the Gaussian Mixture Model (GMM) are the most popular distribution representations used to model scientific datasets. Therefore, we propose the use of multi-set histogram and GMM modeling algorithms for the scenario of large-scale scientific data processing. Our algorithms are developed by data-parallel primitives to achieve portability across different hardware architectures. We evaluate the performance of the proposed algorithms in detail and demonstrate use cases for scientific data processing.

2018 ◽  
Vol 210 ◽  
pp. 05016
Author(s):  
Mariusz Chmielewski ◽  
Damian Frąszczak ◽  
Dawid Bugajewski

This paper discusses experiences and architectural concepts developed and tested aimed at acquisition and processing of biomedical data in large scale system for elderly (patients) monitoring. Major assumptions for the research included utilisation of wearable and mobile technologies, supporting maximum number of inertial and biomedical data to support decision algorithms. Although medical diagnostics and decision algorithms have not been the main aim of the research, this preliminary phase was crucial to test capabilities of existing off-the-shelf technologies and functional responsibilities of system’s logic components. Architecture variants contained several schemes for data processing moving the responsibility for signal feature extraction, data classification and pattern recognition from wearable to mobile up to server facilities. Analysis of transmission and processing delays provided architecture variants pros and cons but most of all knowledge about applicability in medical, military and fitness domains. To evaluate and construct architecture, a set of alternative technology stacks and quantitative measures has been defined. The major architecture characteristics (high availability, scalability, reliability) have been defined imposing asynchronous processing of sensor data, efficient data representation, iterative reporting, event-driven processing, restricting pulling operations. Sensor data processing persist the original data on handhelds but is mainly aimed at extracting chosen set of signal features calculated for specific time windows – varying for analysed signals and the sensor data acquisition rates. Long term monitoring of patients requires also development of mechanisms, which probe the patient and in case of detecting anomalies or drastic characteristic changes tune the data acquisition process. This paper describes experiences connected with design of scalable decision support tool and evaluation techniques for architectural concepts implemented within the mobile and server software.


2014 ◽  
Vol 513-517 ◽  
pp. 1464-1469 ◽  
Author(s):  
Zhi Kun Chen ◽  
Shu Qiang Yang ◽  
Shuang Tan ◽  
Hui Zhao ◽  
Li He ◽  
...  

With the development of Internet technology and Cloud Computing, more and more applications have to be confronted with the challenges of big data. NoSQL Database is fit to the management of big data because of the characteristics of high scalability, high availability and high fault-tolerance. And it is one of the technologies of the management of big data. We will improve the performance of massive data processing of NoSQL Database through the large scale data parallel data processing and data localize of computing. So how to allocate the data will be a big challenge of NoSQL Database. In this paper we will propose a data allocation strategy based on the nodes load, which can adjust the data allocation strategy by the execute status of the system. And it can keep the balance of data allocation by a small cost. At last we will use some experiments to verify the effectiveness of the strategy which is proposed in this paper. The experiments show that it can improve the systems performance than other allocation strategy.


Author(s):  
Krzysztof Jurczuk ◽  
Marcin Czajkowski ◽  
Marek Kretowski

AbstractThis paper concerns the evolutionary induction of decision trees (DT) for large-scale data. Such a global approach is one of the alternatives to the top-down inducers. It searches for the tree structure and tests simultaneously and thus gives improvements in the prediction and size of resulting classifiers in many situations. However, it is the population-based and iterative approach that can be too computationally demanding to apply for big data mining directly. The paper demonstrates that this barrier can be overcome by smart distributed/parallel processing. Moreover, we ask the question whether the global approach can truly compete with the greedy systems for large-scale data. For this purpose, we propose a novel multi-GPU approach. It incorporates the knowledge of global DT induction and evolutionary algorithm parallelization together with efficient utilization of memory and computing GPU’s resources. The searches for the tree structure and tests are performed simultaneously on a CPU, while the fitness calculations are delegated to GPUs. Data-parallel decomposition strategy and CUDA framework are applied. Experimental validation is performed on both artificial and real-life datasets. In both cases, the obtained acceleration is very satisfactory. The solution is able to process even billions of instances in a few hours on a single workstation equipped with 4 GPUs. The impact of data characteristics (size and dimension) on convergence and speedup of the evolutionary search is also shown. When the number of GPUs grows, nearly linear scalability is observed what suggests that data size boundaries for evolutionary DT mining are fading.


2008 ◽  
Vol 25 (5) ◽  
pp. 287-300 ◽  
Author(s):  
B. Martin ◽  
A. Al‐Shabibi ◽  
S.M. Batraneanu ◽  
Ciobotaru ◽  
G.L. Darlea ◽  
...  

Sign in / Sign up

Export Citation Format

Share Document