Improving the Performance of Deduplication-Based Backup Systems via Container Utilization Based Hot Fingerprint Entry Distilling

2021 ◽  
Vol 17 (4) ◽  
pp. 1-23
Author(s):  
Datong Zhang ◽  
Yuhui Deng ◽  
Yi Zhou ◽  
Yifeng Zhu ◽  
Xiao Qin

Data deduplication techniques construct an index consisting of fingerprint entries to identify and eliminate duplicated copies of repeating data. The bottleneck of disk-based index lookup and data fragmentation caused by eliminating duplicated chunks are two challenging issues in data deduplication. Deduplication-based backup systems generally employ containers storing contiguous chunks together with their fingerprints to preserve data locality for alleviating the two issues, which is still inadequate. To address these two issues, we propose a container utilization based hot fingerprint entry distilling strategy to improve the performance of deduplication-based backup systems. We divide the index into three parts: hot fingerprint entries, fragmented fingerprint entries, and useless fingerprint entries. A container with utilization smaller than a given threshold is called a sparse container . Fingerprint entries that point to non-sparse containers are hot fingerprint entries. For the remaining fingerprint entries, if a fingerprint entry matches any fingerprint of forthcoming backup chunks, it is classified as a fragmented fingerprint entry. Otherwise, it is classified as a useless fingerprint entry. We observe that hot fingerprint entries account for a small part of the index, whereas the remaining fingerprint entries account for the majority of the index. This intriguing observation inspires us to develop a hot fingerprint entry distilling approach named HID . HID segregates useless fingerprint entries from the index to improve memory utilization and bypass disk accesses. In addition, HID separates fragmented fingerprint entries to make a deduplication-based backup system directly rewrite fragmented chunks, thereby alleviating adverse fragmentation. Moreover, HID introduces a feature to treat fragmented chunks as unique chunks. This feature compensates for the shortcoming that a Bloom filter cannot directly identify certain duplicated chunks (i.e., the fragmented chunks). To take full advantage of the preceding feature, we propose an evolved HID strategy called EHID . EHID incorporates a Bloom filter, to which only hot fingerprints are mapped. In doing so, EHID exhibits two salient features: (i) EHID avoids disk accesses to identify unique chunks and the fragmented chunks; (ii) EHID slashes the false positive rate of the integrated Bloom filter. These salient features push EHID into the high-efficiency mode. Our experimental results show our approach reduces the average memory overhead of the index by 34.11% and 25.13% when using the Linux dataset and the FSL dataset, respectively. Furthermore, compared with the state-of-the-art method HAR, EHID boosts the average backup throughput by up to a factor of 2.25 with the Linux dataset, and EHID reduces the average disk I/O traffic by up to 66.21% when it comes to the FSL dataset. EHID also marginally improves the system's restore performance.

2015 ◽  
Vol 2015 ◽  
pp. 1-12
Author(s):  
Siyu Lin ◽  
Hao Wu

Cyber-physical systems (CPSs) connect with the physical world via communication networks, which significantly increases security risks of CPSs. To secure the sensitive data, secure forwarding is an essential component of CPSs. However, CPSs require high dimensional multiattribute and multilevel security requirements due to the significantly increased system scale and diversity, and hence impose high demand on the secure forwarding information query and storage. To tackle these challenges, we propose a practical secure data forwarding scheme for CPSs. Considering the limited storage capability and computational power of entities, we adopt bloom filter to store the secure forwarding information for each entity, which can achieve well balance between the storage consumption and query delay. Furthermore, a novel link-based bloom filter construction method is designed to reduce false positive rate during bloom filter construction. Finally, the effects of false positive rate on the performance of bloom filter-based secure forwarding with different routing policies are discussed.


2010 ◽  
Vol 110 (21) ◽  
pp. 944-949 ◽  
Author(s):  
Ken Christensen ◽  
Allen Roginsky ◽  
Miguel Jimeno

2019 ◽  
Vol 35 (23) ◽  
pp. 4871-4878
Author(s):  
Peng Jiang ◽  
Jie Luo ◽  
Yiqi Wang ◽  
Pingji Deng ◽  
Bertil Schmidt ◽  
...  

Abstract Motivation K-mers along with their frequency have served as an elementary building block for error correction, repeat detection, multiple sequence alignment, genome assembly, etc., attracting intensive studies in k-mer counting. However, the output of k-mer counters itself is large; very often, it is too large to fit into main memory, leading to highly narrowed usability. Results We introduce a novel idea of encoding k-mers as well as their frequency, achieving good memory saving and retrieval efficiency. Specifically, we propose a Bloom filter-like data structure to encode counted k-mers by coupled-bit arrays—one for k-mer representation and the other for frequency encoding. Experiments on five real datasets show that the average memory-saving ratio on all 31-mers is as high as 13.81 as compared with raw input, with 7 hash functions. At the same time, the retrieval time complexity is well controlled (effectively constant), and the false-positive rate is decreased by two orders of magnitude. Availability and implementation The source codes of our algorithm are available at github.com/lzhLab/kmcEx. Supplementary information Supplementary data are available at Bioinformatics online.


2018 ◽  
Vol 7 (2.8) ◽  
pp. 13
Author(s):  
B Tirapathi Reddy ◽  
M V. P. Chandra Sekhara Rao

Storing data in cloud has become a necessity as users are accumulating abundant data every day and they are running out of physical storage devices. But majority of the data in the cloud storage is redundant. Data deduplication using convergent key encryption has been the mechanism popularly used to eliminate redundant data items in the cloud storage. Convergent key encryption suffers from various drawbacks. For instance, if data items are deduplicated based on convergent key, any unauthorized user can compromise the cloud storage by simply having a guessed hash of the file. So, ensuring the ownership of the data items is essential to protect the data items. As cukoo filter offers the minimum false positive rate, with minimal space overhead our mechanism has provided the proof of ownership.


2016 ◽  
Vol 2016 ◽  
pp. 1-7 ◽  
Author(s):  
Hazalila Kamaludin ◽  
Hairulnizam Mahdin ◽  
Jemal H. Abawajy

Radio Frequency Identification (RFID) enabled systems are evolving in many applications that need to know the physical location of objects such as supply chain management. Naturally, RFID systems create large volumes of duplicate data. As the duplicate data wastes communication, processing, and storage resources as well as delaying decision-making, filtering duplicate data from RFID data stream is an important and challenging problem. Existing Bloom Filter-based approaches for filtering duplicate RFID data streams are complex and slow as they use multiple hash functions. In this paper, we propose an approach for filtering duplicate data from RFID data streams. The proposed approach is based on modified Bloom Filter and uses only a single hash function. We performed extensive empirical study of the proposed approach and compared it against the Bloom Filter, d-Left Time Bloom Filter, and the Count Bloom Filter approaches. The results show that the proposed approach outperforms the baseline approaches in terms of false positive rate, execution time, and true positive rate.


Entropy ◽  
2021 ◽  
Vol 23 (8) ◽  
pp. 1091
Author(s):  
Siqi Sun ◽  
Yining Qian ◽  
Ruoshi Zhang ◽  
Yanqi Wang ◽  
Xinran Li

With the development of information technology, it has become a popular topic to share data from multiple sources without privacy disclosure problems. Privacy-preserving record linkage (PPRL) can link the data that truly matches and does not disclose personal information. In the existing studies, the techniques of PPRL have mostly been studied based on the alphabetic language, which is much different from the Chinese language environment. In this paper, Chinese characters (identification fields in record pairs) are encoded into strings composed of letters and numbers by using the SoundShape code according to their shapes and pronunciations. Then, the SoundShape codes are encrypted by Bloom filter, and the similarity of encrypted fields is calculated by Dice similarity. In this method, the false positive rate of Bloom filter and different proportions of sound code and shape code are considered. Finally, we performed the above methods on the syntheticdatasets, and compared the precision, recall, F1-score and computational time with different values of false positive rate and proportion. The results showed that our method for PPRL in Chinese language environment improved the quality of the classification results and outperformed others with a relatively low additional cost of computation.


2016 ◽  
Vol 17 (2) ◽  
pp. 31-46 ◽  
Author(s):  
Khan Bahadar Khan ◽  
Amir A Khaliq ◽  
Muhammad Shahid ◽  
Sheroz Khan

Retinal damage caused due to complications of diabetes is known as Diabetic Retinopathy (DR). In this case, the vision is obscured due to the damage of retinal tinny blood vessels of the retina. These tinny blood vessels may cause leakage which affect the vision and can lead to complete blindness. Identification of these new retinal vessels and their structure is essential for analysis of DR. Automatic blood vessels segmentation plays a significant role to assist subsequent automatic methodologies that aid to such analysis. In literature most of the people have used computationally hungry a strong preprocessing steps followed by a simple thresholding and post processing, But in our proposed technique we utilize an arrangement of  light pre-processing which consists of Contrast Limited Adaptive Histogram Equalization (CLAHE) for contrast enhancement, a difference image of green channel from its Gaussian blur filtered image to remove local noise or geometrical object, Modified Iterative Self Organizing Data Analysis Technique (MISODATA) for segmentation of vessel and non-vessel pixels based on global and local thresholding, and a strong  post processing using region properties (area, eccentricity) to eliminate the unwanted region/segment, non-vessel pixels and noise that never been used to reject misclassified foreground pixels. The strategy is tested on the publically accessible DRIVE (Digital Retinal Images for Vessel Extraction) and STARE (STructured Analysis of the REtina) databases. The performance of proposed technique is assessed comprehensively and the acquired accuracy, robustness, low complexity and high efficiency and very less computational time that make the method an efficient tool for automatic retinal image analysis. Proposed technique perform well as compared to the existing strategies on the online available databases in term of accuracy, sensitivity, specificity, false positive rate, true positive rate and area under receiver operating characteristic (ROC) curve.


2012 ◽  
Vol 6-7 ◽  
pp. 790-795
Author(s):  
Teng Fei Guo ◽  
Jian Biao Mao ◽  
Zhi Gang Sun

Bloom filter is a space-efficient data with a certain probability of false positive . We present a reusable hardware implementation framework, define a module interface to provide users with a customize module, and introduce the constraints of hardware resources in the analysis of false positive rate against the traditional Bloom filter hardware design and analysis of the Bloom filter false positives. Finally, we make verification and analysis of our design combined with the the NetMagic platform.


2019 ◽  
Vol 5 (suppl) ◽  
pp. 44-44 ◽  
Author(s):  
Geoffrey R. Oxnard ◽  
Eric A. Klein ◽  
Michael Seiden ◽  
Earl Hubbell ◽  
Oliver Venn ◽  
...  

44 Background: A noninvasive cfDNA blood test detecting multiple cancers at earlier stages could decrease cancer mortality. In earlier discovery work, whole-genome bisulfite sequencing outperformed whole-genome and targeted sequencing approaches for multi-cancer detection across stages at high specificity. Here, multi-cancer detection and TOO localization using bisulfite sequencing of plasma cfDNA to identify methylomic signatures was evaluated in preparation for clinical validation, utility, and implementation studies. Methods: 2301 analyzable participants (1422 cancer [ > 20 tumor types, all stages], 879 non-cancer) were included in this prespecified substudy from the Circulating Cell-free Genome Atlas (CCGA) (NCT02889978) study - a prospective, multi-center, observational, case-control study with longitudinal follow-up. Plasma cfDNA was subjected to a targeted methylation sequencing assay using high-efficiency methylation chemistry to enrich for methylation targets, and a machine learning classifier determined cancer status and tissue of origin (TOO). Observed methylation fragments characteristic of cancer and TOO were combined across targeted regions and assigned a relative probability of cancer and of a specific TOO. Results: Performance is reported at 99% specificity (ie, a combined false positive rate across all cancer types of 1%), a level required for population-level screening. Across cancer types, sensitivity ranged from 59 to 86%. Combined cancer detection (sensitivity [95% CI]) was 34% (27-43%) in stage I (n = 151), 77% (70-83%) in stage II (n = 171), 84% (79-89%) in stage III (n = 204), and 92% (88-95%) in stage IV (n = 281). TOO was provided for 94% of all cancers detected; of these, TOO was correct in > 90% of cases. Conclusions: Detection of multiple deadly cancers across stages using methylation signatures in plasma cfDNA was achieved with a single, fixed, low false positive rate, and simultaneously provided accurate TOO localization. This targeted methylation assay is undergoing validation in preparation for prospective clinical investigation as a cancer detection diagnostic. Clinical trial information: NCT02889978.


Sign in / Sign up

Export Citation Format

Share Document