scholarly journals Data Deduplication with Encrypted Parameters using Big Data and Cloud

Distributed computing empowers organizations to devour a figure asset, for example, a virtual machine (VM), stockpiling or an application, as a utility simply like power as opposed to building and keep up registering frameworks in house. In distributed computing, the most significant part is server farm, where client's/client's information is put away. In server farms, the information may be transferred various time or information can be hacked along these lines, while utilizing the cloud benefits the information should be encoded and put away. With the consistent and exponential increment of the quantity of clients and the size of their information, information deduplication turns out to be increasingly more a need for distributed storage suppliers. By putting away a one of a kind duplicate of copy information, cloud suppliers significantly decrease their stockpiling and information move costs. As a result of the approved information holders who get the symmetric the encoded information can likewise be safely gotten to. Keys utilized for unscrambling of information. The outcomes demonstrate the predominant productivity and viability of the plan for huge information deduplication in distributed storage. Assess its exhibition dependent on broad examination and PC re-enactments with the assistance of logs caught at the hour of deduplication.

Author(s):  
MD. Jareena Begum ◽  
B. Haritha

Cloud computing assumes an essential job in the business stage as figuring assets are conveyed on request to clients over the Internet. Distributed computing gives on-request and pervasive access to a concentrated pool of configurable assets, for example, systems, applications, and administrations This guarantees the vast majority of undertakings and number of clients externalize their information into the cloud worker. As of late, secure deduplication strategies have bid extensive interests in the both scholastic and mechanical associations. The primary preferred position of utilizing distributed storage from the clients' perspective is that they can diminish their consumption in buying and keeping up capacity framework. By the creating data size of appropriated registering, a decline in data volumes could help providers reducing the costs of running gigantic accumulating system and saving power usage. So information deduplication strategies have been proposed to improve capacity effectiveness in cloud stockpiles. Also, thinking about the assurance of delicate documents. Before putting away the records into the cloude stockpile they frequently utilize some encryption calculations to ensure them.In this paper we propose stratagies for secure information deduplication


Data de-duplication is a standout amongst the most explicit coagulation strategy for dispensing with indistinguishable duplicates of improved information in distributed storage to Defeat the measurement of the storage space and the recovery of the transmission ability. Information pressure performs coherent decrease of storage room by least hashing. To ensure the classification of delicate information while supporting de-duplication, the merged encryption procedure has been proposed to encode the information before re-appropriating. To more readily ensure information security, it endeavors to formally address the issue of approved information de-duplication. Not quite the same as customary de-duplication, the benefits of client improved by upgrading their capacity limit and security examination .it likewise present a few new de-duplication developments supporting approved copy check in crossover cloud design. Security examination shows that are set up to keep away from unapproved get to. As proof of notion, we are updating the model of our proposed approved copy control system. and lead proving ground tests utilizing our model. We demonstrate that our proposed approved copy check conspire brings about negligible overhead contrasted with ordinary operations. Deduplication has demonstrated to accomplish high space and cost investment funds and many distributed storage suppliers are presently embracing it. Deduplication can diminish capacity needs by up to 90-95 percent for reinforcement


2018 ◽  
Vol 7 (2.4) ◽  
pp. 46 ◽  
Author(s):  
Shubhanshi Singhal ◽  
Akanksha Kaushik ◽  
Pooja Sharma

Due to drastic growth of digital data, data deduplication has become a standard component of modern backup systems. It reduces data redundancy, saves storage space, and simplifies the management of data chunks. This process is performed in three steps: chunking, fingerprinting, and indexing of fingerprints. In chunking, data files are divided into the chunks and the chunk boundary is decided by the value of the divisor. For each chunk, a unique identifying value is computed using a hash signature (i.e. MD-5, SHA-1, SHA-256), known as fingerprint. At last, these fingerprints are stored in the index to detect redundant chunks means chunks having the same fingerprint values. In chunking, the chunk size is an important factor that should be optimal for better performance of deduplication system. Genetic algorithm (GA) is gaining much popularity and can be applied to find the best value of the divisor. Secondly, indexing also enhances the performance of the system by reducing the search time. Binary search tree (BST) based indexing has the time complexity of  which is minimum among the searching algorithm. A new model is proposed by associating GA to find the value of the divisor. It is the first attempt when GA is applied in the field of data deduplication. The second improvement in the proposed system is that BST index tree is applied to index the fingerprints. The performance of the proposed system is evaluated on VMDK, Linux, and Quanto datasets and a good improvement is achieved in deduplication ratio.


Author(s):  
Victor Olago ◽  
Lina Bartels ◽  
Tafadzwa Dhokotera ◽  
Lina Bartels ◽  
Julia Bohlius ◽  
...  

IntroductionThe South African HIV Cancer Match (SAM) study is a probabilistic record linkage study involving creation of an HIV cohort from laboratory records from the National Health Laboratory Service (NHLS). This cohort was linked to the pathology based South African National Cancer Registry to establish cancer incidences among HIV positive population in South Africa. As the number of HIV records increases, there is need for more efficient ways of de-duplicating this big-data. In this work, we used clustering to perform big-data deduplication. Objectives and ApproachOur objective was to use DBSCAN as clustering algorithm together with bi-gram word analyser to perform big-data deduplication in resource-limited settings. We used HIV related laboratory records from entire South Africa collated in the NHLS Corporate Data Warehouse for period 2004-2014. This involved data pre-processing, deterministic deduplication, ngrams generation, features generation using Term Frequency Inverse Document Frequency vectorizer, clustering using DBSCAN and assigning cluster labels for records that potentially belonged to the same person. We used records with national identification numbers to assess quality of deduplication by calculating precision, recall and f-measure. ResultsWe had 51,563,127 HIV related laboratory records. Deterministic deduplication resulted in 20,387,819 patient record deduplicates. With DBSCAN clustering we further reduced this to 14,849,524 patient record clusters. In this final dataset, 3,355,544 (22.60%) patients had negative HIV test, 11,316,937 (76.21%) had evidence for HIV infection, and for 177,043 (1.19%) the HIV status could not be determined. The precision, recall and f-measure based on 1,865,445 records with national identification numbers were 0.96, 0.94 and 0.95, respectively. Conclusion / ImplicationsOur study demonstrated that DBSCAN clustering is an effective way of deduplicating big datasets in resource-limited settings. This enabled refining of an HIV observational database by accurately linking test records that potentially belonged to the same person. The methodology creates opportunities for easy data profiling to inform public health decision making.


Sign in / Sign up

Export Citation Format

Share Document