scholarly journals Sandbox security model for Hadoop file system

2020 ◽  
Vol 7 (1) ◽  
Author(s):  
Gousiya Begum ◽  
S. Zahoor Ul Huq ◽  
A. P. Siva Kumar

Abstract Extensive usage of Internet based applications in day to day life has led to generation of huge amounts of data every minute. Apart from humans, data is generated by machines like sensors, satellite, CCTV etc. This huge collection of heterogeneous data is often referred as Big Data which can be processed to draw useful insights. Apache Hadoop has emerged has widely used open source software framework for Big Data Processing and it is a cluster of cooperative computers enabling distributed parallel processing. Hadoop Distributed File System is used to store data blocks replicated and spanned across different nodes. HDFS uses an AES based cryptographic techniques at block level which is transparent and end to end in nature. However cryptography provides security from unauthorized access to the data blocks, but a legitimate user can still harm the data. One such example was execution of malicious map reduce jar files by legitimate user which can harm the data in the HDFS. We developed a mechanism where every map reduce jar will be tested by our sandbox security to ensure the jar is not malicious and suspicious jar files are not allowed to process the data in the HDFS. This feature is not present in the existing Apache Hadoop framework and our work is made available in github for consideration and inclusion in the future versions of Apache Hadoop.

The study of Hadoop Distributed File System (HDFS) and Map Reduce (MR) are the key aspects of the Hadoop framework. The big data scenarios like Face Book (FB) data processing or the twitter analytics such as storing the tweets and processing the tweets is other scenario of big data which can depends on Hadoop framework to perform the storage and processing through which further analytics can be done. The point here is the usage of space and time in the processing of the above-mentioned huge amounts of the data definitely leads to higher amounts of space and time consumption of the Hadoop framework. The problem here is usage of huge amounts of the space and at the same time the processing time is also high which need to be reduced so as to get the fastest response from the framework. The attempt is important as all the other eco system tools also depends on HDFS and MR so as to perform the data storage and processing of the data and alternative architecture so as to improve the usage of the space and effective utilization of the resources so as to reduce the time requirements of the framework. The outcome of the work is faster data processing and less space utilization of the framework in the processing of MR along with other eco system tools like Hive, Flume, Sqoop and Pig Latin. The work is proposing an alternative framework of the HDFS and MR and the name we are assigning is Unified Space Allocation and Data Processing with Metadata based Distributed File System (USAMDFS).


The chapter explains that NoSQL databases emerged as an attempt for resolving the limitations of relational databases in coping with Big Data. The issue of Big Data is related to extensive requests for the storage and management of complex, dynamic, evolving, distributed, and heterogeneous data from different sources and platforms. The chapter provides an overview of the technologies, including Google File System (GFS), MapReduce, Hadoop, and Hadoop Distributed File System (HDFS), which were the first responses to Big Data challenges and main driving forces for the development of NoSQL databases. Also, the chapter asserts that NoSQL is an umbrella term related to numerous databases with different architectures and purposes, which can be classified in four basic categories: key-value, column-family, document, and graph stores. The chapter discusses the general features of NoSQL databases, as well as the specific features of each of the four basic categories of NoSQL databases.


2016 ◽  
pp. 1220-1243
Author(s):  
Ilias K. Savvas ◽  
Georgia N. Sofianidou ◽  
M-Tahar Kechadi

Big data refers to data sets whose size is beyond the capabilities of most current hardware and software technologies. The Apache Hadoop software library is a framework for distributed processing of large data sets, while HDFS is a distributed file system that provides high-throughput access to data-driven applications, and MapReduce is software framework for distributed computing of large data sets. Huge collections of raw data require fast and accurate mining processes in order to extract useful knowledge. One of the most popular techniques of data mining is the K-means clustering algorithm. In this study, the authors develop a distributed version of the K-means algorithm using the MapReduce framework on the Hadoop Distributed File System. The theoretical and experimental results of the technique prove its efficiency; thus, HDFS and MapReduce can apply to big data with very promising results.


2019 ◽  
Vol 16 (9) ◽  
pp. 3824-3829
Author(s):  
Deepak Ahlawat ◽  
Deepali Gupta

Due to advancement in the technological world, there is a great surge in data. The main sources of generating such a large amount of data are social websites, internet sites etc. The large data files are combined together to create a big data architecture. Managing the data file in such a large volume is not easy. Therefore, modern techniques are developed to manage bulk data. To arrange and utilize such big data, Hadoop Distributed File System (HDFS) architecture from Hadoop was presented in the early stage of 2015. This architecture is used when traditional methods are insufficient to manage the data. In this paper, a novel clustering algorithm is implemented to manage a large amount of data. The concepts and frames of Big Data are studied. A novel algorithm is developed using the K means and cosine-based similarity clustering in this paper. The developed clustering algorithm is evaluated using the precision and recall parameters. The prominent results are obtained which successfully manages the big data issue.


Big data security is the most focused research issue nowadays due to their increased size and the complexity involved in handling of large volume of data. It is more difficult to ensure security on big data handling due to its characteristics 4V’s. With the aim of ensuring security and flexible encryption computation on big data with reduced computation overhead in this work, framework with encryption (MRS) is presented with Hadoop Distributed file System (HDFS). Development of the MapReduce paradigm needs networked attached storage in addition to parallel processing. For storing as well as handling big data, HDFS are extensively utilized. This proposed method creates a framework for obtaining data from client and after that examining the received data, excerpt privacy policy and after that find the sensitive data. The security is guaranteed in this framework using key rotation algorithm which is an efficient encryption and decryption technique for safeguarding the data over big data. Data encryption is a means to protect data in storage with containing a key encryption saved and accessible to reuse the data while required. The outcome shows that the research method guarantees greater security for enormous amount of data and gives beneficial info to related clients. Therefore the outcome concluded that the proposed method is superior to the previous method. Finally, this research can be applied effectively on the various domains such as health care domains, educational domains, social networking domains, etc which require more security and increased volume of data.


2019 ◽  
Vol 8 (4) ◽  
pp. 10051-10056

In recent years, big data is huge amount of data to uncover hidden attributes. Today’s technologies has possible to analyze the data and get data is almost immediately. Why big data is very important? Because cost reduction, faster, and better decision making using Hadoop. For example a large warehouse of terabytes of data is generated daily from social media’s like Twitter, LinkedIn and Facebook are case of organization in the people to people communication area for big data. Big data has 3 most important challenges of Volume, Variety, and Velocity. In this paper we have studied about the performance of Traditional Distributed File System (TDFS) and Hadoop Distributed File System (HDFS). Benefits of HDFS has support for flume tool in Hadoop comparing with TDFS. Memory block size data retrieving time and security are used as metrics in evaluating the performance of TDFS and HDFS. Result shows HDFC performs better than TDFS in the above metrics and HDFS is more suitable for big data analysis comparing of TDFS.


2019 ◽  
Author(s):  
Rhauani W. Fazul ◽  
Patricia Pitthan Barcelos

Data replication is a fundamental mechanism of the Hadoop Distributed File System (HDFS). However, the way data is spread across the cluster directly affects the replication balancing. The HDFS Balancer is a Hadoop integrated tool which can balance the storage load on each machine by moving data between nodes, although its operation does not address the specific needs of applications while performing block rearrangement. This paper proposes a customized balancing policy for HDFS Balancer based on a system of priorities, which can be adapted and configured according to usage demands. The priorities define whether HDFS parameters, or whether cluster topology should be considered during the operation, thus making the balancing more flexible.


2020 ◽  
Vol 6 ◽  
pp. e259
Author(s):  
Gayatri Kapil ◽  
Alka Agrawal ◽  
Abdulaziz Attaallah ◽  
Abdullah Algarni ◽  
Rajeev Kumar ◽  
...  

Hadoop has become a promising platform to reliably process and store big data. It provides flexible and low cost services to huge data through Hadoop Distributed File System (HDFS) storage. Unfortunately, absence of any inherent security mechanism in Hadoop increases the possibility of malicious attacks on the data processed or stored through Hadoop. In this scenario, securing the data stored in HDFS becomes a challenging task. Hence, researchers and practitioners have intensified their efforts in working on mechanisms that would protect user’s information collated in HDFS. This has led to the development of numerous encryption-decryption algorithms but their performance decreases as the file size increases. In the present study, the authors have enlisted a methodology to solve the issue of data security in Hadoop storage. The authors have integrated Attribute Based Encryption with the honey encryption on Hadoop, i.e., Attribute Based Honey Encryption (ABHE). This approach works on files that are encoded inside the HDFS and decoded inside the Mapper. In addition, the authors have evaluated the proposed ABHE algorithm by performing encryption-decryption on different sizes of files and have compared the same with existing ones including AES and AES with OTP algorithms. The ABHE algorithm shows considerable improvement in performance during the encryption-decryption of files.


Sign in / Sign up

Export Citation Format

Share Document