data deduplication
Recently Published Documents


TOTAL DOCUMENTS

410
(FIVE YEARS 179)

H-INDEX

20
(FIVE YEARS 3)

2022 ◽  
Vol 2022 ◽  
pp. 1-10
Author(s):  
Tingting Yu

In order to meet the requirements of users in terms of speed, capacity, storage efficiency, and security, with the goal of improving data redundancy and reducing data storage space, an unbalanced big data compatible cloud storage method based on redundancy elimination technology is proposed. A new big data acquisition platform is designed based on Hadoop and NoSQL technologies. Through this platform, efficient unbalanced data acquisition is realized. The collected data are classified and processed by classifier. The classified unbalanced big data are compressed by Huffman algorithm, and the data security is improved by data encryption. Based on the data processing results, the big data redundancy processing is carried out by using the data deduplication algorithm. The cloud platform is designed to store redundant data in the cloud. The results show that the method in this paper has high data deduplication rate and data deduplication speed rate and low data storage space and effectively reduces the burden of data storage.


2021 ◽  
Vol 17 (4) ◽  
pp. 1-23
Author(s):  
Datong Zhang ◽  
Yuhui Deng ◽  
Yi Zhou ◽  
Yifeng Zhu ◽  
Xiao Qin

Data deduplication techniques construct an index consisting of fingerprint entries to identify and eliminate duplicated copies of repeating data. The bottleneck of disk-based index lookup and data fragmentation caused by eliminating duplicated chunks are two challenging issues in data deduplication. Deduplication-based backup systems generally employ containers storing contiguous chunks together with their fingerprints to preserve data locality for alleviating the two issues, which is still inadequate. To address these two issues, we propose a container utilization based hot fingerprint entry distilling strategy to improve the performance of deduplication-based backup systems. We divide the index into three parts: hot fingerprint entries, fragmented fingerprint entries, and useless fingerprint entries. A container with utilization smaller than a given threshold is called a sparse container . Fingerprint entries that point to non-sparse containers are hot fingerprint entries. For the remaining fingerprint entries, if a fingerprint entry matches any fingerprint of forthcoming backup chunks, it is classified as a fragmented fingerprint entry. Otherwise, it is classified as a useless fingerprint entry. We observe that hot fingerprint entries account for a small part of the index, whereas the remaining fingerprint entries account for the majority of the index. This intriguing observation inspires us to develop a hot fingerprint entry distilling approach named HID . HID segregates useless fingerprint entries from the index to improve memory utilization and bypass disk accesses. In addition, HID separates fragmented fingerprint entries to make a deduplication-based backup system directly rewrite fragmented chunks, thereby alleviating adverse fragmentation. Moreover, HID introduces a feature to treat fragmented chunks as unique chunks. This feature compensates for the shortcoming that a Bloom filter cannot directly identify certain duplicated chunks (i.e., the fragmented chunks). To take full advantage of the preceding feature, we propose an evolved HID strategy called EHID . EHID incorporates a Bloom filter, to which only hot fingerprints are mapped. In doing so, EHID exhibits two salient features: (i) EHID avoids disk accesses to identify unique chunks and the fragmented chunks; (ii) EHID slashes the false positive rate of the integrated Bloom filter. These salient features push EHID into the high-efficiency mode. Our experimental results show our approach reduces the average memory overhead of the index by 34.11% and 25.13% when using the Linux dataset and the FSL dataset, respectively. Furthermore, compared with the state-of-the-art method HAR, EHID boosts the average backup throughput by up to a factor of 2.25 with the Linux dataset, and EHID reduces the average disk I/O traffic by up to 66.21% when it comes to the FSL dataset. EHID also marginally improves the system's restore performance.


Author(s):  
Pradeep Nayak ◽  
Poornachandra S ◽  
Pawan J Acharya ◽  
Shravya ◽  
Shravani

Deduplication methods were designed to destroy copy information which bring about capacity of single duplicates of information as it were. Information Deduplication diminishes the circle space needed to store the back-ups in the extra room, tracks and kill the second duplicate of information inside the capacity unit. It permits as it were one case information event to be put away initially and afterward following occasions will be given reference pointer to the first information put away. In a Big information stockpiling climate, immense measure of information should be secure. For this legitimate administration, work, misrepresentation identification, investigation of information protection is an significant theme to be thought of. This paper inspects and assesses the common deduplication procedures and which are introduced in plain structure. In this review, it was seen that the secrecy and security of information has been undermined at numerous levels in common strategies for deduplication. Albeit much exploration is being done in different zones of distributed computing still work relating to this point is inadequate. To get rid of duplicate data which results in storage of single copies of data, data deduplication techniques were used. Data deduplication helps in decreasing storage capacity requirements and eliminates extra copies of same data inside storage unit. Proper management, work, fraud detection, analysis of data privacy are the topics to be considered in a big data storage environment, since, large amount of data needs to be secure. At many levels in general techniques for deduplication it is observed that safety of data and confidentiality has been compromised. Even though more research is being carried out in different areas of cloud computing still work related to this topic is little.


2021 ◽  
pp. 102332
Author(s):  
Ge Kan ◽  
Chunhua Jin ◽  
Huihui Zhu ◽  
Yongliang Xu ◽  
Nian Liu

Sign in / Sign up

Export Citation Format

Share Document