scholarly journals Comparative Analysis for Content Defined Chunking Algorithms in Data Deduplication

Webology ◽  
2021 ◽  
Vol 18 (Special Issue 02) ◽  
pp. 255-268
Author(s):  
D. Viji ◽  
Dr.S. Revathy

Data deduplication works on eliminating redundant data and reducing storage consumption. Nowadays more data generated and it was stored in the cloud repeatedly, due to this large volume of storage will be consumed. Data deduplication tries to reduce data volumes disk space and network bandwidth can be to reduce costs and energy consumption for running storage systems. In the data deduplication method, data broken into small size of chunk or block. Hash ID will be calculated for all the blocks then it’s compared with existing blocks for duplication. Blocks may be fixed or variable size, compared with a fixed size of block variable size chunking gives a better result. So the chunking process is the initial task of deduplication to get an optimal result. In this paper, we discussed various content defined chunking algorithms and their performance based on chunking properties like chunking speed, processing time, and throughput.

2018 ◽  
Vol 10 (4) ◽  
pp. 43-66 ◽  
Author(s):  
Shubhanshi Singhal ◽  
Pooja Sharma ◽  
Rajesh Kumar Aggarwal ◽  
Vishal Passricha

This article describes how data deduplication efficiently eliminates the redundant data by selecting and storing only single instance of it and becoming popular in storage systems. Digital data is growing much faster than storage volumes, which shows the importance of data deduplication among scientists and researchers. Data deduplication is considered as most successful and efficient technique of data reduction because it is computationally efficient and offers a lossless data reduction. It is applicable to various storage systems, i.e. local storage, distributed storage, and cloud storage. This article discusses the background, components, and key features of data deduplication which helps the reader to understand the design issues and challenges in this field.


2018 ◽  
Vol 7 (2.8) ◽  
pp. 13
Author(s):  
B Tirapathi Reddy ◽  
M V. P. Chandra Sekhara Rao

Storing data in cloud has become a necessity as users are accumulating abundant data every day and they are running out of physical storage devices. But majority of the data in the cloud storage is redundant. Data deduplication using convergent key encryption has been the mechanism popularly used to eliminate redundant data items in the cloud storage. Convergent key encryption suffers from various drawbacks. For instance, if data items are deduplicated based on convergent key, any unauthorized user can compromise the cloud storage by simply having a guessed hash of the file. So, ensuring the ownership of the data items is essential to protect the data items. As cukoo filter offers the minimum false positive rate, with minimal space overhead our mechanism has provided the proof of ownership.


Author(s):  
Sumit Kumar Mahana ◽  
Rajesh Kumar Aggarwal

In the present digital scenario, data is of prime significance for individuals and moreover for organizations. With the passage of time, data content being produced increases exponentially, which poses a serious concern as the huge amount of redundant data contents stored on the cloud employs a severe load on the cloud storage systems itself which cannot be accepted. Therefore, a storage optimization strategy is a fundamental prerequisite to cloud storage systems. Data deduplication is a storage optimization strategy that is used for deleting identical copies of redundant data, optimizing bandwidth, improves utilization of storage space, and hence, minimizes storage cost. To guarantee the security parameter, the data which is stored on the cloud must be in an encrypted form to ensure the security of the stored data. Consequently, executing deduplication safely over the encrypted information in the cloud seems to be a challenging job. This chapter discusses various existing data deduplication techniques with a notion of securing the data on the cloud that addresses this challenge.


The enormous growth of digital data, especially the data in unstructured format has brought a tremendous challenge on data analysis as well as the data storage systems which are essentially increasing the cost and performance of the backup systems. The traditional systems do not provide any optimization techniques to keep the duplicated data from being backed up. Deduplication of data has become an essential and financial way of the capacity optimization technique which replaces the redundant data. The following paper reviews the deduplication process, types of deduplication and techniques available for data deduplication. Also, many approaches proposed by various researchers on deduplication in Big data storage systems are studied and compared.


Author(s):  
Sumit Kumar Mahana ◽  
Rajesh Kumar Aggarwal

In the present digital scenario, data is of prime significance for individuals and moreover for organizations. With the passage of time, data content being produced increases exponentially, which poses a serious concern as the huge amount of redundant data contents stored on the cloud employs a severe load on the cloud storage systems itself which cannot be accepted. Therefore, a storage optimization strategy is a fundamental prerequisite to cloud storage systems. Data deduplication is a storage optimization strategy that is used for deleting identical copies of redundant data, optimizing bandwidth, improves utilization of storage space, and hence, minimizes storage cost. To guarantee the security parameter, the data which is stored on the cloud must be in an encrypted form to ensure the security of the stored data. Consequently, executing deduplication safely over the encrypted information in the cloud seems to be a challenging job. This chapter discusses various existing data deduplication techniques with a notion of securing the data on the cloud that addresses this challenge.


2018 ◽  
Vol 8 (11) ◽  
pp. 2216
Author(s):  
Jiahui Jin ◽  
Qi An ◽  
Wei Zhou ◽  
Jiakai Tang ◽  
Runqun Xiong

Network bandwidth is a scarce resource in big data environments, so data locality is a fundamental problem for data-parallel frameworks such as Hadoop and Spark. This problem is exacerbated in multicore server-based clusters, where multiple tasks running on the same server compete for the server’s network bandwidth. Existing approaches solve this problem by scheduling computational tasks near the input data and considering the server’s free time, data placements, and data transfer costs. However, such approaches usually set identical values for data transfer costs, even though a multicore server’s data transfer cost increases with the number of data-remote tasks. Eventually, this hampers data-processing time, by minimizing it ineffectively. As a solution, we propose DynDL (Dynamic Data Locality), a novel data-locality-aware task-scheduling model that handles dynamic data transfer costs for multicore servers. DynDL offers greater flexibility than existing approaches by using a set of non-decreasing functions to evaluate dynamic data transfer costs. We also propose online and offline algorithms (based on DynDL) that minimize data-processing time and adaptively adjust data locality. Although DynDL is NP-complete (nondeterministic polynomial-complete), we prove that the offline algorithm runs in quadratic time and generates optimal results for DynDL’s specific uses. Using a series of simulations and real-world executions, we show that our algorithms are 30% better than algorithms that do not consider dynamic data transfer costs in terms of data-processing time. Moreover, they can adaptively adjust data localities based on the server’s free time, data placement, and network bandwidth, and schedule tens of thousands of tasks within subseconds or seconds.


10.28945/3033 ◽  
2006 ◽  
Author(s):  
G. Adesola Aderounmu ◽  
Bosede Oyatokun ◽  
Matthew Adigun

This paper presents a comparative analysis of Remote Method Invocation (RMI) and Mobile Agent (MA) paradigm used to implement the information storage and retrieval system in a distributed computing environment. Simulation program was developed to measure the performance of MA and RMI using object oriented programming language (the following parameters: search time, fault tolerance and invocation cost. We used search time, fault tolerance and invocation cost as performance parameters in this research work. Experimental results showed that Mobile Agent paradigm offers a superior performance compared to RMI paradigm, offers fast computational speed; procure lower invocation cost by making local invocations instead of remote invocations over the network, thereby reducing network bandwidth. Finally MA has a better fault tolerance than the RMI. With a probability of failure pr = 0.1, mobile agent degrades gracefully.


2013 ◽  
Vol 2 (3) ◽  
pp. 58-71
Author(s):  
Tudorica Bogdan George

The application presented in the following subsections intends to cover one of the noticeable gaps of the NoSQL domain, namely the relative lack of working tools and systems administration for new large data storage systems. Following the comparative analysis of the NoSQL solutions on the market, the MongoDB system was chosen as target application for this step of development, for reasons mainly related to proven performance, flexibility, market presence already in place and ease of use.


Sign in / Sign up

Export Citation Format

Share Document