scholarly journals Various Approaches Proposed for Eliminating Duplicate Data in a System

Author(s):  
Roman Čerešňák ◽  
Karol Matiaško ◽  
Adam Dudáš

The growth of big data processing market led to an increase in the overload of computation data centers, change of methods used in storing the data, communication between the computing units and computational time needed to process or edit the data. Methods of distributed or parallel data processing brought new problems related to computations with data which need to be examined. Unlike the conventional cloud services, a tight connection between the data and the computations is one of the main characteristics of the big data services. The computational tasks can be done only if relevant data are available. Three factors, which influence the speed and efficiency of data processing are - data duplicity, data integrity and data security. We are motivated to study the problems related to the growing time needed for data processing by optimizing these three factors in geographically distributed data centers.      

2020 ◽  
Vol 33 (12) ◽  
pp. e4453 ◽  
Author(s):  
Iftikhar Ahmad ◽  
Muhammad Imran Khan Khalil ◽  
Syed Adeel Ali Shah

2018 ◽  
Vol 60 (5-6) ◽  
pp. 321-326 ◽  
Author(s):  
Christoph Boden ◽  
Tilmann Rabl ◽  
Volker Markl

Abstract The last decade has been characterized by the collection and availability of unprecedented amounts of data due to rapidly decreasing storage costs and the omnipresence of sensors and data-producing global online-services. In order to process and analyze this data deluge, novel distributed data processing systems resting on the paradigm of data flow such as Apache Hadoop, Apache Spark, or Apache Flink were built and have been scaled to tens of thousands of machines. However, writing efficient implementations of data analysis programs on these systems requires a deep understanding of systems programming, prohibiting large groups of data scientists and analysts from efficiently using this technology. In this article, we present some of the main achievements of the research carried out by the Berlin Big Data Cente (BBDC). We introduce the two domain-specific languages Emma and LARA, which are deeply embedded in Scala and enable declarative specification and the automatic parallelization of data analysis programs, the PEEL Framework for transparent and reproducible benchmark experiments of distributed data processing systems, approaches to foster the interpretability of machine learning models and finally provide an overview of the challenges to be addressed in the second phase of the BBDC.


2019 ◽  
Vol 214 ◽  
pp. 07007
Author(s):  
Petr Fedchenkov ◽  
Andrey Shevel ◽  
Sergey Khoruzhnikov ◽  
Oleg Sadov ◽  
Oleg Lazo ◽  
...  

ITMO University (ifmo.ru) is developing the cloud of geographically distributed data centres. The geographically distributed means data centres (DC) located in different places far from each other by hundreds or thousands of kilometres. Usage of the geographically distributed data centres promises a number of advantages for end users such as opportunity to add additional DC and service availability through redundancy and geographical distribution. Services like data transfer, computing, and data storage are provided to users in the form of virtual objects including virtual machines, virtual storage, virtual data transfer link.


2016 ◽  
Vol 24 (11) ◽  
pp. 12310 ◽  
Author(s):  
Payman Samadi ◽  
Ke Wen ◽  
Junjie Xu ◽  
Keren Bergman

Sign in / Sign up

Export Citation Format

Share Document