Sector and Sphere: the design and implementation of a high-performance data cloud

Cloud computing has demonstrated that processing very large datasets over commodity clusters can be done simply, given the right programming model and infrastructure. In this paper, we describe the design and implementation of the Sector storage cloud and the Sphere compute cloud. By contrast with the existing storage and compute clouds, Sector can manage data not only within a data centre, but also across geographically distributed data centres. Similarly, the Sphere compute cloud supports user-defined functions (UDFs) over data both within and across data centres. As a special case, MapReduce-style programming can be implemented in Sphere by using a Map UDF followed by a Reduce UDF. We describe some experimental studies comparing Sector/Sphere and Hadoop using the Terasort benchmark. In these studies, Sector is approximately twice as fast as Hadoop. Sector/Sphere is open source.

Download Full-text

Task-based programming in COMPSs to converge from HPC to big data

The International Journal of High Performance Computing Applications ◽

10.1177/1094342017701278 ◽

2017 ◽

Vol 32 (1) ◽

pp. 45-60 ◽

Cited By ~ 11

Author(s):

Javier Conejero ◽

Sandra Corella ◽

Rosa M Badia ◽

Jesus Labarta

Keyword(s):

Big Data ◽

High Performance ◽

Programming Model ◽

Good Alternative ◽

Programming Models ◽

Suitable Model ◽

Advantages And Disadvantages ◽

Big Data Applications ◽

And Performance ◽

The Right

Task-based programming has proven to be a suitable model for high-performance computing (HPC) applications. Different implementations have been good demonstrators of this fact and have promoted the acceptance of task-based programming in the OpenMP standard. Furthermore, in recent years, Apache Spark has gained wide popularity in business and research environments as a programming model for addressing emerging big data problems. COMP Superscalar (COMPSs) is a task-based environment that tackles distributed computing (including Clouds) and is a good alternative for a task-based programming model for big data applications. This article describes why we consider that task-based programming models are a good approach for big data applications. The article includes a comparison of Spark and COMPSs in terms of architecture, programming model, and performance. It focuses on the differences that both frameworks have in structural terms, on their programmability interface, and in terms of their efficiency by means of three widely known benchmarking kernels: Wordcount, Kmeans, and Terasort. These kernels enable the evaluation of the more important functionalities of both programming models and analyze different work flows and conditions. The main results achieved from this comparison are (1) COMPSs is able to extract the inherent parallelism from the user code with minimal coding effort as opposed to Spark, which requires the existing algorithms to be adapted and rewritten by explicitly using their predefined functions, (2) it is an improvement in terms of performance when compared with Spark, and (3) COMPSs has shown to scale better than Spark in most cases. Finally, we discuss the advantages and disadvantages of both frameworks, highlighting the differences that make them unique, thereby helping to choose the right framework for each particular objective.

Download Full-text

A Distributed Optimization Method for the Geographically Distributed Data Centres Problem

Integration of AI and OR Techniques in Constraint Programming - Lecture Notes in Computer Science ◽

10.1007/978-3-319-59776-8_12 ◽

2017 ◽

pp. 147-166 ◽

Cited By ~ 1

Author(s):

Mohamed Wahbi ◽

Diarmuid Grimes ◽

Deepak Mehta ◽

Kenneth N. Brown ◽

Barry O’Sullivan

Keyword(s):

Distributed Optimization ◽

Optimization Method ◽

Distributed Data ◽

Geographically Distributed ◽

Data Centres

Download Full-text

The cloud of geographically distributed data centers

EPJ Web of Conferences ◽

10.1051/epjconf/201921407007 ◽

2019 ◽

Vol 214 ◽

pp. 07007

Author(s):

Petr Fedchenkov ◽

Andrey Shevel ◽

Sergey Khoruzhnikov ◽

Oleg Sadov ◽

Oleg Lazo ◽

...

Keyword(s):

Data Storage ◽

Data Centers ◽

Virtual Machines ◽

Data Transfer ◽

Service Availability ◽

Distributed Data ◽

Virtual Storage ◽

Distribution Services ◽

Geographically Distributed ◽

Data Centres

ITMO University (ifmo.ru) is developing the cloud of geographically distributed data centres. The geographically distributed means data centres (DC) located in different places far from each other by hundreds or thousands of kilometres. Usage of the geographically distributed data centres promises a number of advantages for end users such as opportunity to add additional DC and service availability through redundancy and geographical distribution. Services like data transfer, computing, and data storage are provided to users in the form of virtual objects including virtual machines, virtual storage, virtual data transfer link.

Download Full-text

Apache Nemo: A Framework for Optimizing Distributed Data Processing

ACM Transactions on Computer Systems ◽

10.1145/3468144 ◽

2020 ◽

Vol 38 (3-4) ◽

pp. 1-31

Author(s):

Won Wook Song ◽

Youngseok Yang ◽

Jeongyoon Eo ◽

Jangho Seo ◽

Joo Yeon Kim ◽

...

Keyword(s):

Data Processing ◽

High Performance ◽

Programming Model ◽

Compiler Optimization ◽

Ease Of Use ◽

Distributed Data ◽

Performance Improvements ◽

Distributed Data Processing ◽

Fine Control ◽

High Level

Optimizing scheduling and communication of distributed data processing for resource and data characteristics is crucial for achieving high performance. Existing approaches to such optimizations largely fall into two categories. First, distributed runtimes provide low-level policy interfaces to apply the optimizations, but do not ensure the maintenance of correct application semantics and thus often require significant effort to use. Second, policy interfaces that extend a high-level application programming model ensure correctness, but do not provide sufficient fine control. We describe Apache Nemo, an optimization framework for distributed dataflow processing that provides fine control for high performance and also ensures correctness for ease of use. We combine several techniques to achieve this, including an intermediate representation of dataflow, compiler optimization passes, and runtime extensions. Our evaluation results show that Nemo enables composable and reusable optimizations that bring performance improvements on par with existing specialized runtimes tailored for a specific deployment scenario. Apache Nemo is open-sourced at https://nemo.apache.org as an Apache incubator project.

Download Full-text

Parallel and Distributed Data Mining through Parallel Skeletons and Distributed Objects

Data Mining ◽

10.4018/978-1-59140-051-6.ch005 ◽

2011 ◽

pp. 106-141

Author(s):

Massimo Coppola ◽

Marco Vanneschi

Keyword(s):

Data Mining ◽

High Performance ◽

Distributed Databases ◽

Sequential Algorithm ◽

Distributed Data ◽

Programming Environment ◽

Programming Environments ◽

Geographically Distributed ◽

High Level ◽

Performance Results

We consider the application of parallel programming environments to develop portable and efficient high performance data mining (DM) tools. We first assess the need of parallel and distributed DM applications, by pointing out the problems of scalability of some mining techniques and the need to mine large, eventually geographically distributed databases. We discuss the main issues of exploiting parallel and distributed computation for DM algorithms. A high-level programming language enhances the software engineering aspects of parallel DM, and it simplifies the problems of integration with existing sequential and parallel data management systems, thus leading to programming-efficient and high-performance implementations of applications. We describe a programming environment we have implemented that is based on the parallel skeleton model, and we examine the addition of object-like interfaces toward external libraries and system software layers. This kind of abstractions will be included in the forthcoming programming environment ASSIST. In the main part of the chapter, as a proof-of-concept we describe three well-known DM algorithms, Apriori, C4.5, and DBSCAN. For each problem, we explain the sequential algorithm and a structured parallel version, which is discussed and compared to parallel solutions found in the literature. We also discuss the potential gain in performance and expressiveness from the addition of external objects on the basis of the experiments we performed so far. We evaluate the approach with respect to performance results, design, and implementation considerations.

Download Full-text

A survey on Approaches Used for Efficient Workload Management in Geo-Distributed Data Centres

International Journal of Engineering & Technology ◽

10.14419/ijet.v7i3.34.18924 ◽

2018 ◽

Vol 7 (3.34) ◽

pp. 141

Author(s):

D Ramya ◽

J Deepa ◽

P N.Karthikayan

Keyword(s):

Operating Systems ◽

Data Center ◽

Data Centers ◽

Distributed Data ◽

Cloud Infrastructure ◽

Workload Management ◽

Software Applications ◽

Geographically Distributed ◽

Cloud Infrastructures ◽

Data Centres

A geographically distributed Data center assures Globalization of data and also security for the organizations. The principles for Disaster recovery is also taken into consideration. The above aspects drive business opportunities to companies that own many sites and Cloud Infrastructures with multiple owners. The data centers store very critical and confidential documents that multiple organizations share in the cloud infrastructure. Previously different servers with different Operating systems and software applications were used. As it was difficult to maintain, Servers are consolidated which allows sharing of resources at low of cost maintenance [7]. The availability of documents should be increased and down time should be reduced. Thus workload management becomes a challenging among the data centers distributed geographically. In this paper we focus on different approaches used for workload management in Geo-distributed data centers. The algorithms used and also the challenges involved in different approaches are discussed

Download Full-text

NESTED OPENMP PARALLELIZATION OF A HIERARCHICAL DATA CLUSTERING ALGORITHM

Parallel Processing Letters ◽

10.1142/s0129626410000144 ◽

2010 ◽

Vol 20 (02) ◽

pp. 187-208

Author(s):

PANAGIOTIS E. HADJIDOUKAS ◽

LAURENT AMSALEG

Keyword(s):

Data Clustering ◽

High Performance ◽

Clustering Algorithm ◽

Programming Model ◽

Parallel Implementation ◽

Hierarchical Data ◽

Effective Utilization ◽

Very Large Datasets ◽

Clustering Approach ◽

Computational Resources

This paper presents a high performance parallel implementation of a hierarchical data clustering algorithm. The OpenMP programming model, either enhanced with our lightweight runtime support or through its tasking model, deals with the high irregularity of the algorithm and allows for efficient exploitation of the inherent loop-level nested parallelism. Thorough experimental evaluation demonstrates the performance scalability of our parallelization and the effective utilization of computational resources, which results in a clustering approach able to provide high quality clustering of very large datasets.

Download Full-text