scholarly journals Sector and Sphere: the design and implementation of a high-performance data cloud

Author(s):  
Yunhong Gu ◽  
Robert L. Grossman

Cloud computing has demonstrated that processing very large datasets over commodity clusters can be done simply, given the right programming model and infrastructure. In this paper, we describe the design and implementation of the Sector storage cloud and the Sphere compute cloud. By contrast with the existing storage and compute clouds, Sector can manage data not only within a data centre, but also across geographically distributed data centres. Similarly, the Sphere compute cloud supports user-defined functions (UDFs) over data both within and across data centres. As a special case, MapReduce-style programming can be implemented in Sphere by using a Map UDF followed by a Reduce UDF. We describe some experimental studies comparing Sector/Sphere and Hadoop using the Terasort benchmark. In these studies, Sector is approximately twice as fast as Hadoop. Sector/Sphere is open source.

Author(s):  
Javier Conejero ◽  
Sandra Corella ◽  
Rosa M Badia ◽  
Jesus Labarta

Task-based programming has proven to be a suitable model for high-performance computing (HPC) applications. Different implementations have been good demonstrators of this fact and have promoted the acceptance of task-based programming in the OpenMP standard. Furthermore, in recent years, Apache Spark has gained wide popularity in business and research environments as a programming model for addressing emerging big data problems. COMP Superscalar (COMPSs) is a task-based environment that tackles distributed computing (including Clouds) and is a good alternative for a task-based programming model for big data applications. This article describes why we consider that task-based programming models are a good approach for big data applications. The article includes a comparison of Spark and COMPSs in terms of architecture, programming model, and performance. It focuses on the differences that both frameworks have in structural terms, on their programmability interface, and in terms of their efficiency by means of three widely known benchmarking kernels: Wordcount, Kmeans, and Terasort. These kernels enable the evaluation of the more important functionalities of both programming models and analyze different work flows and conditions. The main results achieved from this comparison are (1) COMPSs is able to extract the inherent parallelism from the user code with minimal coding effort as opposed to Spark, which requires the existing algorithms to be adapted and rewritten by explicitly using their predefined functions, (2) it is an improvement in terms of performance when compared with Spark, and (3) COMPSs has shown to scale better than Spark in most cases. Finally, we discuss the advantages and disadvantages of both frameworks, highlighting the differences that make them unique, thereby helping to choose the right framework for each particular objective.


2019 ◽  
Vol 214 ◽  
pp. 07007
Author(s):  
Petr Fedchenkov ◽  
Andrey Shevel ◽  
Sergey Khoruzhnikov ◽  
Oleg Sadov ◽  
Oleg Lazo ◽  
...  

ITMO University (ifmo.ru) is developing the cloud of geographically distributed data centres. The geographically distributed means data centres (DC) located in different places far from each other by hundreds or thousands of kilometres. Usage of the geographically distributed data centres promises a number of advantages for end users such as opportunity to add additional DC and service availability through redundancy and geographical distribution. Services like data transfer, computing, and data storage are provided to users in the form of virtual objects including virtual machines, virtual storage, virtual data transfer link.


2020 ◽  
Vol 38 (3-4) ◽  
pp. 1-31
Author(s):  
Won Wook Song ◽  
Youngseok Yang ◽  
Jeongyoon Eo ◽  
Jangho Seo ◽  
Joo Yeon Kim ◽  
...  

Optimizing scheduling and communication of distributed data processing for resource and data characteristics is crucial for achieving high performance. Existing approaches to such optimizations largely fall into two categories. First, distributed runtimes provide low-level policy interfaces to apply the optimizations, but do not ensure the maintenance of correct application semantics and thus often require significant effort to use. Second, policy interfaces that extend a high-level application programming model ensure correctness, but do not provide sufficient fine control. We describe Apache Nemo, an optimization framework for distributed dataflow processing that provides fine control for high performance and also ensures correctness for ease of use. We combine several techniques to achieve this, including an intermediate representation of dataflow, compiler optimization passes, and runtime extensions. Our evaluation results show that Nemo enables composable and reusable optimizations that bring performance improvements on par with existing specialized runtimes tailored for a specific deployment scenario. Apache Nemo is open-sourced at https://nemo.apache.org as an Apache incubator project.


Data Mining ◽  
2011 ◽  
pp. 106-141
Author(s):  
Massimo Coppola ◽  
Marco Vanneschi

We consider the application of parallel programming environments to develop portable and efficient high performance data mining (DM) tools. We first assess the need of parallel and distributed DM applications, by pointing out the problems of scalability of some mining techniques and the need to mine large, eventually geographically distributed databases. We discuss the main issues of exploiting parallel and distributed computation for DM algorithms. A high-level programming language enhances the software engineering aspects of parallel DM, and it simplifies the problems of integration with existing sequential and parallel data management systems, thus leading to programming-efficient and high-performance implementations of applications. We describe a programming environment we have implemented that is based on the parallel skeleton model, and we examine the addition of object-like interfaces toward external libraries and system software layers. This kind of abstractions will be included in the forthcoming programming environment ASSIST. In the main part of the chapter, as a proof-of-concept we describe three well-known DM algorithms, Apriori, C4.5, and DBSCAN. For each problem, we explain the sequential algorithm and a structured parallel version, which is discussed and compared to parallel solutions found in the literature. We also discuss the potential gain in performance and expressiveness from the addition of external objects on the basis of the experiments we performed so far. We evaluate the approach with respect to performance results, design, and implementation considerations.


2018 ◽  
Vol 7 (3.34) ◽  
pp. 141
Author(s):  
D Ramya ◽  
J Deepa ◽  
P N.Karthikayan

A geographically distributed Data center assures Globalization of data and also security for the organizations. The principles for Disaster recovery is also taken into consideration. The above aspects drive business opportunities to companies that own many sites and Cloud Infrastructures with multiple owners.  The data centers store very critical and confidential documents that multiple organizations share in the cloud infrastructure. Previously different servers with different Operating systems and software applications were used. As it was difficult to maintain, Servers are consolidated which allows sharing of resources at low of cost maintenance [7]. The availability of documents should be increased and down time should be reduced. Thus workload management becomes a challenging among the data centers distributed geographically. In this paper we focus on different approaches used for workload management in Geo-distributed data centers. The algorithms used and also the challenges involved in different approaches are discussed 


2010 ◽  
Vol 20 (02) ◽  
pp. 187-208
Author(s):  
PANAGIOTIS E. HADJIDOUKAS ◽  
LAURENT AMSALEG

This paper presents a high performance parallel implementation of a hierarchical data clustering algorithm. The OpenMP programming model, either enhanced with our lightweight runtime support or through its tasking model, deals with the high irregularity of the algorithm and allows for efficient exploitation of the inherent loop-level nested parallelism. Thorough experimental evaluation demonstrates the performance scalability of our parallelization and the effective utilization of computational resources, which results in a clustering approach able to provide high quality clustering of very large datasets.


2020 ◽  
Vol 196 ◽  
pp. 105777
Author(s):  
Jadson Jose Monteiro Oliveira ◽  
Robson Leonardo Ferreira Cordeiro

Sign in / Sign up

Export Citation Format

Share Document