Accelerating Large-Scale Data Analysis by Offloading to High-Performance Computing Libraries using Alchemist

With the development of high-performance computing and big data applications, the scale of data transmitted, stored, and processed by high-performance computing cluster systems is increasing explosively. Efficient compression of large-scale data and reducing the space required for data storage and transmission is one of the keys to improving the performance of high-performance computing cluster systems. In this paper, we present SW-LZMA, a parallel design and optimization of LZMA based on the Sunway 26010 heterogeneous many-core processor. Combined with the characteristics of SW26010 processors, we analyse the storage space requirements, memory access characteristics, and hotspot functions of the LZMA algorithm and implement the thread-level parallelism of the LZMA algorithm based on Athread interface. Furthermore, we make a fine-grained layout of LDM address space to achieve DMA double buffer cyclic sliding window algorithm, which optimizes the performance of SW-LZMA. The experimental results show that compared with the serial baseline implementation of LZMA, the parallel LZMA algorithm obtains a maximum speedup ratio of 4.1 times using the Silesia corpus benchmark, while on the large-scale data set, speedup is 5.3 times.

Download Full-text

Work in progress — Integration of the scientific workflow paradigm into high performance computing and large scale data management curricula

2010 IEEE Frontiers in Education Conference (FIE) ◽

10.1109/fie.2010.5673235 ◽

2010 ◽

Author(s):

Brandeis Marshall ◽

John Springer ◽

Thomas Hacker

Keyword(s):

Data Management ◽

High Performance Computing ◽

High Performance ◽

Large Scale ◽

Scientific Workflow ◽

Work In Progress ◽

Large Scale Data ◽

Performance Computing ◽

Scale Data

Download Full-text

Enabling low latency at large-scale data center and high-performance computing interconnect networks using fine-grained all-optical switching technology

2017 International Conference on Optical Network Design and Modeling (ONDM) ◽

10.23919/ondm.2017.7958532 ◽

2017 ◽

Cited By ~ 3

Author(s):

Nan Hua ◽

Zhizhen Zhong ◽

Xiaoping Zheng

Keyword(s):

Data Center ◽

High Performance ◽

Optical Switching ◽

Large Scale ◽

Fine Grained ◽

Large Scale Data ◽

All Optical ◽

Performance Computing ◽

All Optical Switching ◽

Scale Data

Download Full-text

Enabling Large-Scale Biomedical Analysis in the Cloud

BioMed Research International ◽

10.1155/2013/185679 ◽

2013 ◽

Vol 2013 ◽

pp. 1-6 ◽

Cited By ~ 10

Author(s):

Ying-Chih Lin ◽

Chin-Sheng Yu ◽

Yen-Jen Lin

Keyword(s):

High Performance ◽

Large Scale ◽

Computing System ◽

Biomedical Data ◽

Data Intensive Computing ◽

Biomedical Analysis ◽

Data Intensive ◽

Large Scale Data ◽

Performance Computing ◽

Scale Data

Recent progress in high-throughput instrumentations has led to an astonishing growth in both volume and complexity of biomedical data collected from various sources. The planet-size data brings serious challenges to the storage and computing technologies. Cloud computing is an alternative to crack the nut because it gives concurrent consideration to enable storage and high-performance computing on large-scale data. This work briefly introduces the data intensive computing system and summarizes existing cloud-based resources in bioinformatics. These developments and applications would facilitate biomedical research to make the vast amount of diversification data meaningful and usable.

Download Full-text

Large-scale HPC deployment of Scalable CyberInfrastructure for Artificial Intelligence and Likelihood Free Inference (SCAILFIN)

EPJ Web of Conferences ◽

10.1051/epjconf/202024509011 ◽

2020 ◽

Vol 245 ◽

pp. 09011

Author(s):

Michael Hildreth ◽

Kenyi Paolo Hurtado Anampa ◽

Cody Kankel ◽

Scott Hampton ◽

Paul Brenner ◽

...

Keyword(s):

Artificial Intelligence ◽

Data Analysis ◽

High Performance Computing ◽

High Throughput ◽

High Performance ◽

Large Scale ◽

Starting Point ◽

Virtual Clusters ◽

Analysis Platform ◽

Performance Computing

The NSF-funded Scalable CyberInfrastructure for Artificial Intelligence and Likelihood Free Inference (SCAILFIN) project aims to develop and deploy artificial intelligence (AI) and likelihood-free inference (LFI) techniques and software using scalable cyberinfrastructure (CI) built on top of existing CI elements. Specifically, the project has extended the CERN-based REANA framework, a cloud-based data analysis platform deployed on top of Kubernetes clusters that was originally designed to enable analysis reusability and reproducibility. REANA is capable of orchestrating extremely complicated multi-step workflows, and uses Kubernetes clusters both for scheduling and distributing container-based workloads across a cluster of available machines, as well as instantiating and monitoring the concrete workloads themselves. This work describes the challenges and development efforts involved in extending REANA and the components that were developed in order to enable large scale deployment on High Performance Computing (HPC) resources. Using the Virtual Clusters for Community Computation (VC3) infrastructure as a starting point, we implemented REANA to work with a number of differing workload managers, including both high performance and high throughput, while simultaneously removing REANA’s dependence on Kubernetes support at the workers level.

Download Full-text

IMSM: An Interval Migration Based Approach for Skew Mitigation in MapReduce

Recent Patents on Computer Science ◽

10.2174/2213275912666190405141745 ◽

2019 ◽

Vol 12 ◽

Author(s):

Balraj Singh ◽

Harsh K Verma

Keyword(s):

Load Balance ◽

Completion Time ◽

High Performance ◽

Large Scale ◽

Research Work ◽

Novel Technique ◽

Large Scale Data ◽

Load Imbalance ◽

Performance Computing ◽

Scale Data

Background: Extreme growth of data necessitates the need for high-performance computing. MapReduce is among the most sought-after platform for processing large-scale data. Research work and analysis of the existing system has revealed its performance bottlenecks and areas of the concern. MapReduce suffers extremely from the problem of skew and load imbalance on its processing nodes. Objective: This paper proposes a novel technique for MapReduce to lower the skew on Map tasks and improve the load balance. It reduces the execution time of job by lowering the completion time of the slowest task. Method:Proposed method performs one-time settlement of load balancing among the Map tasks by analyzing the expected completion time of the Map tasks and redistributes the load. It uses intervals to migrate the overloaded or slows tasks and append them on the under loaded tasks or free slots. Result:Experiments reveal an improvement of up to 1.3x by implementing the proposed strategy and comparing it with the relevant techniques using different datasets. Conclusion:Significant improvement is observed in the performance as a result of lower completion time of a job. Proposed technique exhibits reduced amount of skew and a uniform distribution of load among Map nodes.

Download Full-text