Improving the Performance of Deduplication-Based Backup Systems via Container Utilization Based Hot Fingerprint Entry Distilling

Data deduplication techniques construct an index consisting of fingerprint entries to identify and eliminate duplicated copies of repeating data. The bottleneck of disk-based index lookup and data fragmentation caused by eliminating duplicated chunks are two challenging issues in data deduplication. Deduplication-based backup systems generally employ containers storing contiguous chunks together with their fingerprints to preserve data locality for alleviating the two issues, which is still inadequate. To address these two issues, we propose a container utilization based hot fingerprint entry distilling strategy to improve the performance of deduplication-based backup systems. We divide the index into three parts: hot fingerprint entries, fragmented fingerprint entries, and useless fingerprint entries. A container with utilization smaller than a given threshold is called a sparse container . Fingerprint entries that point to non-sparse containers are hot fingerprint entries. For the remaining fingerprint entries, if a fingerprint entry matches any fingerprint of forthcoming backup chunks, it is classified as a fragmented fingerprint entry. Otherwise, it is classified as a useless fingerprint entry. We observe that hot fingerprint entries account for a small part of the index, whereas the remaining fingerprint entries account for the majority of the index. This intriguing observation inspires us to develop a hot fingerprint entry distilling approach named HID . HID segregates useless fingerprint entries from the index to improve memory utilization and bypass disk accesses. In addition, HID separates fragmented fingerprint entries to make a deduplication-based backup system directly rewrite fragmented chunks, thereby alleviating adverse fragmentation. Moreover, HID introduces a feature to treat fragmented chunks as unique chunks. This feature compensates for the shortcoming that a Bloom filter cannot directly identify certain duplicated chunks (i.e., the fragmented chunks). To take full advantage of the preceding feature, we propose an evolved HID strategy called EHID . EHID incorporates a Bloom filter, to which only hot fingerprints are mapped. In doing so, EHID exhibits two salient features: (i) EHID avoids disk accesses to identify unique chunks and the fragmented chunks; (ii) EHID slashes the false positive rate of the integrated Bloom filter. These salient features push EHID into the high-efficiency mode. Our experimental results show our approach reduces the average memory overhead of the index by 34.11% and 25.13% when using the Linux dataset and the FSL dataset, respectively. Furthermore, compared with the state-of-the-art method HAR, EHID boosts the average backup throughput by up to a factor of 2.25 with the Linux dataset, and EHID reduces the average disk I/O traffic by up to 66.21% when it comes to the FSL dataset. EHID also marginally improves the system's restore performance.

Download Full-text

Bloom Filter-Based Secure Data Forwarding in Large-Scale Cyber-Physical Systems

Mathematical Problems in Engineering ◽

10.1155/2015/150512 ◽

2015 ◽

Vol 2015 ◽

pp. 1-12

Author(s):

Siyu Lin ◽

Hao Wu

Keyword(s):

False Positive ◽

Large Scale ◽

False Positive Rate ◽

Bloom Filter ◽

Cyber Physical Systems ◽

Security Requirements ◽

Data Forwarding ◽

Physical Systems ◽

Secure Data ◽

Positive Rate

Cyber-physical systems (CPSs) connect with the physical world via communication networks, which significantly increases security risks of CPSs. To secure the sensitive data, secure forwarding is an essential component of CPSs. However, CPSs require high dimensional multiattribute and multilevel security requirements due to the significantly increased system scale and diversity, and hence impose high demand on the secure forwarding information query and storage. To tackle these challenges, we propose a practical secure data forwarding scheme for CPSs. Considering the limited storage capability and computational power of entities, we adopt bloom filter to store the secure forwarding information for each entity, which can achieve well balance between the storage consumption and query delay. Furthermore, a novel link-based bloom filter construction method is designed to reduce false positive rate during bloom filter construction. Finally, the effects of false positive rate on the performance of bloom filter-based secure forwarding with different routing policies are discussed.

Download Full-text

A new analysis of the false positive rate of a Bloom filter

Information Processing Letters ◽

10.1016/j.ipl.2010.07.024 ◽

2010 ◽

Vol 110 (21) ◽

pp. 944-949 ◽

Cited By ~ 48

Author(s):

Ken Christensen ◽

Allen Roginsky ◽

Miguel Jimeno

Keyword(s):

False Positive ◽

False Positive Rate ◽

Bloom Filter ◽

Positive Rate

Download Full-text

kmcEx: memory-frugal and retrieval-efficient encoding of counted k-mers

Bioinformatics ◽

10.1093/bioinformatics/btz299 ◽

2019 ◽

Vol 35 (23) ◽

pp. 4871-4878

Author(s):

Peng Jiang ◽

Jie Luo ◽

Yiqi Wang ◽

Pingji Deng ◽

Bertil Schmidt ◽

...

Keyword(s):

False Positive Rate ◽

Bloom Filter ◽

Main Memory ◽

Supplementary Information ◽

Multiple Sequence ◽

Source Codes ◽

Retrieval Efficiency ◽

Positive Rate ◽

Elementary Building ◽

Repeat Detection

Abstract Motivation K-mers along with their frequency have served as an elementary building block for error correction, repeat detection, multiple sequence alignment, genome assembly, etc., attracting intensive studies in k-mer counting. However, the output of k-mer counters itself is large; very often, it is too large to fit into main memory, leading to highly narrowed usability. Results We introduce a novel idea of encoding k-mers as well as their frequency, achieving good memory saving and retrieval efficiency. Specifically, we propose a Bloom filter-like data structure to encode counted k-mers by coupled-bit arrays—one for k-mer representation and the other for frequency encoding. Experiments on five real datasets show that the average memory-saving ratio on all 31-mers is as high as 13.81 as compared with raw input, with 7 hash functions. At the same time, the retrieval time complexity is well controlled (effectively constant), and the false-positive rate is decreased by two orders of magnitude. Availability and implementation The source codes of our algorithm are available at github.com/lzhLab/kmcEx. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Privacy preserving proof of ownership for data in cloud storage systems

International Journal of Engineering & Technology ◽

10.14419/ijet.v7i2.8.10317 ◽

2018 ◽

Vol 7 (2.8) ◽

pp. 13

Author(s):

B Tirapathi Reddy ◽

M V. P. Chandra Sekhara Rao

Keyword(s):

False Positive ◽

Cloud Storage ◽

Storage Systems ◽

False Positive Rate ◽

Data Deduplication ◽

Storage Devices ◽

Redundant Data ◽

Proof Of Ownership ◽

Positive Rate ◽

Abundant Data

Storing data in cloud has become a necessity as users are accumulating abundant data every day and they are running out of physical storage devices. But majority of the data in the cloud storage is redundant. Data deduplication using convergent key encryption has been the mechanism popularly used to eliminate redundant data items in the cloud storage. Convergent key encryption suffers from various drawbacks. For instance, if data items are deduplicated based on convergent key, any unauthorized user can compromise the cloud storage by simply having a guessed hash of the file. So, ensuring the ownership of the data items is essential to protect the data items. As cukoo filter offers the minimum false positive rate, with minimal space overhead our mechanism has provided the proof of ownership.

Download Full-text

Filtering Redundant Data from RFID Data Streams

Journal of Sensors ◽

10.1155/2016/7107914 ◽

2016 ◽

Vol 2016 ◽

pp. 1-7 ◽

Cited By ~ 4

Author(s):

Hazalila Kamaludin ◽

Hairulnizam Mahdin ◽

Jemal H. Abawajy

Keyword(s):

Radio Frequency Identification ◽

Data Streams ◽

False Positive Rate ◽

Bloom Filter ◽

True Positive Rate ◽

Redundant Data ◽

Rfid Data ◽

Positive Rate ◽

Frequency Identification ◽

Processing And Storage

Radio Frequency Identification (RFID) enabled systems are evolving in many applications that need to know the physical location of objects such as supply chain management. Naturally, RFID systems create large volumes of duplicate data. As the duplicate data wastes communication, processing, and storage resources as well as delaying decision-making, filtering duplicate data from RFID data stream is an important and challenging problem. Existing Bloom Filter-based approaches for filtering duplicate RFID data streams are complex and slow as they use multiple hash functions. In this paper, we propose an approach for filtering duplicate data from RFID data streams. The proposed approach is based on modified Bloom Filter and uses only a single hash function. We performed extensive empirical study of the proposed approach and compared it against the Bloom Filter, d-Left Time Bloom Filter, and the Count Bloom Filter approaches. The results show that the proposed approach outperforms the baseline approaches in terms of false positive rate, execution time, and true positive rate.

Download Full-text

On the False Positive Rate of the Bloom Filter in Case of Using Multiple Hash Functions

2014 Ninth Asia Joint Conference on Information Security ◽

10.1109/asiajcis.2014.32 ◽

2014 ◽

Cited By ~ 2

Author(s):

Jihong Kim

Keyword(s):

False Positive ◽

False Positive Rate ◽

Hash Functions ◽

Bloom Filter ◽

Positive Rate

Download Full-text

An Improved Chinese String Comparator for Bloom Filter Based Privacy-Preserving Record Linkage

Entropy ◽

10.3390/e23081091 ◽

2021 ◽

Vol 23 (8) ◽

pp. 1091

Author(s):

Siqi Sun ◽

Yining Qian ◽

Ruoshi Zhang ◽

Yanqi Wang ◽

Xinran Li

Keyword(s):

Record Linkage ◽

False Positive ◽

Personal Information ◽

Chinese Language ◽

False Positive Rate ◽

Bloom Filter ◽

Privacy Preserving ◽

Computational Time ◽

Positive Rate ◽

Language Environment

With the development of information technology, it has become a popular topic to share data from multiple sources without privacy disclosure problems. Privacy-preserving record linkage (PPRL) can link the data that truly matches and does not disclose personal information. In the existing studies, the techniques of PPRL have mostly been studied based on the alphabetic language, which is much different from the Chinese language environment. In this paper, Chinese characters (identification fields in record pairs) are encoded into strings composed of letters and numbers by using the SoundShape code according to their shapes and pronunciations. Then, the SoundShape codes are encrypted by Bloom filter, and the similarity of encrypted fields is calculated by Dice similarity. In this method, the false positive rate of Bloom filter and different proportions of sound code and shape code are considered. Finally, we performed the above methods on the syntheticdatasets, and compared the precision, recall, F1-score and computational time with different values of false positive rate and proportion. The results showed that our method for PPRL in Chinese language environment improved the quality of the classification results and outperformed others with a relatively low additional cost of computation.

Download Full-text

AN EFFICIENT TECHNIQUE FOR RETINAL VESSEL SEGMENTATION AND DENOISING USING MODIFIED ISODATA AND CLAHE

IIUM Engineering Journal ◽

10.31436/iiumej.v17i2.611 ◽

2016 ◽

Vol 17 (2) ◽

pp. 31-46 ◽

Cited By ~ 5

Author(s):

Khan Bahadar Khan ◽

Amir A Khaliq ◽

Muhammad Shahid ◽

Sheroz Khan

Keyword(s):

Blood Vessels ◽

High Efficiency ◽

False Positive Rate ◽

Retinal Vessel ◽

Histogram Equalization ◽

True Positive Rate ◽

Computational Time ◽

Post Processing ◽

Positive Rate ◽

Region Segment

Retinal damage caused due to complications of diabetes is known as Diabetic Retinopathy (DR). In this case, the vision is obscured due to the damage of retinal tinny blood vessels of the retina. These tinny blood vessels may cause leakage which affect the vision and can lead to complete blindness. Identification of these new retinal vessels and their structure is essential for analysis of DR. Automatic blood vessels segmentation plays a significant role to assist subsequent automatic methodologies that aid to such analysis. In literature most of the people have used computationally hungry a strong preprocessing steps followed by a simple thresholding and post processing, But in our proposed technique we utilize an arrangement ofÂ light pre-processing which consists of Contrast Limited Adaptive Histogram Equalization (CLAHE) for contrast enhancement, a difference image of green channel from its Gaussian blur filtered image to remove local noise or geometrical object, Modified Iterative Self Organizing Data Analysis Technique (MISODATA) for segmentation of vessel and non-vessel pixels based on global and local thresholding, and a strongÂ post processing using region properties (area, eccentricity) to eliminate the unwanted region/segment, non-vessel pixels and noise that never been used to reject misclassified foreground pixels. The strategy is tested on the publically accessible DRIVE (Digital Retinal Images for Vessel Extraction) and STARE (STructured Analysis of the REtina) databases. The performance of proposed technique is assessed comprehensively and the acquired accuracy, robustness, low complexity and high efficiency and very less computational time that make the method an efficient tool for automatic retinal image analysis. Proposed technique perform well as compared to the existing strategies on the online available databases in term of accuracy, sensitivity, specificity, false positive rate, true positive rate and area under receiver operating characteristic (ROC) curve.

Download Full-text

The Research of Technology of Bloom Filter Realization in Hardware

Advanced Engineering Forum ◽

10.4028/www.scientific.net/aef.6-7.790 ◽

2012 ◽

Vol 6-7 ◽

pp. 790-795

Author(s):

Teng Fei Guo ◽

Jian Biao Mao ◽

Zhi Gang Sun

Keyword(s):

False Positive ◽

Hardware Implementation ◽

False Positive Rate ◽

Hardware Design ◽

Bloom Filter ◽

False Positives ◽

Implementation Framework ◽

Positive Rate ◽

Efficient Data

Bloom filter is a space-efficient data with a certain probability of false positive . We present a reusable hardware implementation framework, define a module interface to provide users with a customize module, and introduce the constraints of hardware resources in the analysis of false positive rate against the traditional Bloom filter hardware design and analysis of the Bloom filter false positives. Finally, we make verification and analysis of our design combined with the the NetMagic platform.

Download Full-text

Simultaneous multi-cancer detection and tissue of origin (TOO) localization using targeted bisulfite sequencing of plasma cell-free DNA (cfDNA).

Journal of Global Oncology ◽

10.1200/jgo.2019.5.suppl.44 ◽

2019 ◽

Vol 5 (suppl) ◽

pp. 44-44 ◽

Cited By ~ 4

Author(s):

Geoffrey R. Oxnard ◽

Eric A. Klein ◽

Michael Seiden ◽

Earl Hubbell ◽

Oliver Venn ◽

...

Keyword(s):

Cancer Detection ◽

False Positive ◽

Bisulfite Sequencing ◽

High Efficiency ◽

False Positive Rate ◽

Detection Sensitivity ◽

Whole Genome ◽

Positive Rate ◽

Cancer Types ◽

Tissue Of Origin

44 Background: A noninvasive cfDNA blood test detecting multiple cancers at earlier stages could decrease cancer mortality. In earlier discovery work, whole-genome bisulfite sequencing outperformed whole-genome and targeted sequencing approaches for multi-cancer detection across stages at high specificity. Here, multi-cancer detection and TOO localization using bisulfite sequencing of plasma cfDNA to identify methylomic signatures was evaluated in preparation for clinical validation, utility, and implementation studies. Methods: 2301 analyzable participants (1422 cancer [ > 20 tumor types, all stages], 879 non-cancer) were included in this prespecified substudy from the Circulating Cell-free Genome Atlas (CCGA) (NCT02889978) study - a prospective, multi-center, observational, case-control study with longitudinal follow-up. Plasma cfDNA was subjected to a targeted methylation sequencing assay using high-efficiency methylation chemistry to enrich for methylation targets, and a machine learning classifier determined cancer status and tissue of origin (TOO). Observed methylation fragments characteristic of cancer and TOO were combined across targeted regions and assigned a relative probability of cancer and of a specific TOO. Results: Performance is reported at 99% specificity (ie, a combined false positive rate across all cancer types of 1%), a level required for population-level screening. Across cancer types, sensitivity ranged from 59 to 86%. Combined cancer detection (sensitivity [95% CI]) was 34% (27-43%) in stage I (n = 151), 77% (70-83%) in stage II (n = 171), 84% (79-89%) in stage III (n = 204), and 92% (88-95%) in stage IV (n = 281). TOO was provided for 94% of all cancers detected; of these, TOO was correct in > 90% of cases. Conclusions: Detection of multiple deadly cancers across stages using methylation signatures in plasma cfDNA was achieved with a single, fixed, low false positive rate, and simultaneously provided accurate TOO localization. This targeted methylation assay is undergoing validation in preparation for prospective clinical investigation as a cancer detection diagnostic. Clinical trial information: NCT02889978.

Download Full-text