Big Data Security Challenges and Solution of Distributed Computing in Hadoop Environment: A Security Framework

2020 ◽  
Vol 13 (4) ◽  
pp. 790-797
Author(s):  
Gurjit Singh Bhathal ◽  
Amardeep Singh Dhiman

Background: In current scenario of internet, large amounts of data are generated and processed. Hadoop framework is widely used to store and process big data in a highly distributed manner. It is argued that Hadoop Framework is not mature enough to deal with the current cyberattacks on the data. Objective: The main objective of the proposed work is to provide a complete security approach comprising of authorisation and authentication for the user and the Hadoop cluster nodes and to secure the data at rest as well as in transit. Methods: The proposed algorithm uses Kerberos network authentication protocol for authorisation and authentication and to validate the users and the cluster nodes. The Ciphertext-Policy Attribute- Based Encryption (CP-ABE) is used for data at rest and data in transit. User encrypts the file with their own set of attributes and stores on Hadoop Distributed File System. Only intended users can decrypt that file with matching parameters. Results: The proposed algorithm was implemented with data sets of different sizes. The data was processed with and without encryption. The results show little difference in processing time. The performance was affected in range of 0.8% to 3.1%, which includes impact of other factors also, like system configuration, the number of parallel jobs running and virtual environment. Conclusion: The solutions available for handling the big data security problems faced in Hadoop framework are inefficient or incomplete. A complete security framework is proposed for Hadoop Environment. The solution is experimentally proven to have little effect on the performance of the system for datasets of different sizes.

Big Data consist large volumes of data sets with various formats i.e., structured, unstructured and semi structured. Big Data requires security because day by day attackers attack on it in different manner. Big Data Security Analytics analyses Big Data for finding various threats and complex attacks. By increasing the number of targeting attacks on data and one side rapid growing of data, it is too difficult to analyze accurately. The Security Analytics Systems are used the untrusted data. So, strong security analytical tools are required to analyze the data. The organizations and industries exchange the data through networks dynamically, so this may become more vulnerable for data misusing and theft. Attackers are more advanced in the attacking on data that the existing security mechanisms are not identified before damaging. At present, the collecting and analyzing various attacks is major challenging task for Security Analytics Systems, to take suitable decision. In this research paper, we have addressed about Hadoop tool that how it analyses Big Data and how Big Data Security Analytics is applied to analyze the various threats and securing the business data more accurately.


2016 ◽  
pp. 1220-1243
Author(s):  
Ilias K. Savvas ◽  
Georgia N. Sofianidou ◽  
M-Tahar Kechadi

Big data refers to data sets whose size is beyond the capabilities of most current hardware and software technologies. The Apache Hadoop software library is a framework for distributed processing of large data sets, while HDFS is a distributed file system that provides high-throughput access to data-driven applications, and MapReduce is software framework for distributed computing of large data sets. Huge collections of raw data require fast and accurate mining processes in order to extract useful knowledge. One of the most popular techniques of data mining is the K-means clustering algorithm. In this study, the authors develop a distributed version of the K-means algorithm using the MapReduce framework on the Hadoop Distributed File System. The theoretical and experimental results of the technique prove its efficiency; thus, HDFS and MapReduce can apply to big data with very promising results.


2018 ◽  
Vol 7 (2.26) ◽  
pp. 80
Author(s):  
Dr E. Laxmi Lydia ◽  
M Srinivasa Rao

The latest and famous subject all over the cloud research area is Big Data; its main appearances are volume, velocity and variety. The characteristics are difficult to manage through traditional software and their various available methodologies. To manage the data which is occurring from various domains of big data are handled through Hadoop, which is open framework software which is mainly developed to provide solutions. Handling of big data analytics is done through Hadoop Map Reduce framework and it is the key engine of hadoop cluster and it is extensively used in these days. It uses batch processing system.Apache developed an engine named "Tez", which supports interactive query system and it won't writes any temporary data into the Hadoop Distributed File System(HDFS).The paper mainly focuses on performance juxtaposition of MapReduce and TeZ, performance of these two engines are examined through the compression of input files and map output files. To compare two engines we used Bzip compression algorithm for the input files and snappy for the map out files. Word Count and Terasort gauge are used on our experiments. For the Word Count gauge, the results shown that Tez engine has better execution time than Hadoop MapReduce engine for the both compressed and non-compressed data. It has reduced the execution time nearly 39% comparing to the execution time of the Hadoop MapReduce engine. Correspondingly for the terasort gauge, the Tez engine has higher execution time than Hadoop MapReduce engine.  


Big data is a late of huge information stored in it and all we need is to dig into get the important information out of it and create a useful system which can be very helpful in improving the current scenario. There are various applications where big data is being used and even there are few fields that are learning techniques to go with big data and evaluate their work and get an improve decision. This paper particularly concentrates on the e commerce system which is highly trending on the market field. [20] E commerce also known as electronic commerce is a market place which gives you a platform to enjoy various services from both buyers as well as sellers. It is a place with various varieties are provided that can help the consumer to choose from and the buyer can get a platform where he can show case his product and get millions of the customer at the same time and he does not have to look for site all the time, it’s the system that take care of it. Now big data is playing a vital role in e commerce as it reads about user behavior and provides him a suitable product that he may need according to his behavior and query. There are various machine learning algorithms that are working on this and improving the services. [11] Basically in this paper we will read the user information and combine it with the product attributes and get a suitable suggestion for the user that will be most likely to be purchased by him. In the existing system we just look at one part of the case and give suggestion but in this paper we looked at both the sides, that is we looked after the product entities (the attributes and features that it poses) and the user behavior (the information given by the user and its previous history) that will better prediction and improve the system. Moreover for the optimized working of the system we included an enhanced version of HPCA scheduling algorithm for the Hadoop distributed file system also known as HDFS, which is very suitable for the heterogeneous system, the existing algorithm looks after the overall capacity of the node and then the tasks were assigned but here we will consider the health and the left over capacity of the nodes and arrange the queue for the same which will be refreshed all the time after the task is completed by any node.[18] The aim of the paper is to provide fast and most suitable suggestions to the users which can play a vital role in improving the sales of the company and getting the target done soon and faster


Author(s):  
Ashwini T ◽  
Sahana LM ◽  
Mahalakshmi E ◽  
Shweta S Padti

— Analysis of consistent and structured data has seen huge success in past decades. Where the analysis of unstructured data in the form of multimedia format remains a challenging task. YouTube is one of the most used and popular social media tool. The main aim of this paper is to analyze the data that is generated from YouTube that can be mined and utilized. API (Application Programming Interface) and going to be stored in Hadoop Distributed File System (HDFS). Dataset can be analyzed using MapReduce. Which is used to identify the video categories in which most number of videos are uploaded. The objective of this paper is to demonstrate Hadoop framework, to process and handle big data there are many components. In the existing method, big data can be analyzed and processed in multiple stages by using MapReduce. Due to huge space consumption of each job, Implementing iterative map reduce jobs is expensive. A Hive method is used to analyze the big data to overcome the drawbacks of existing methods, which is the state-ofthe-art method. The hive works by extracting the YouTube information by generating API (Application Programming Interface) key and uses the SQL queries.


Sign in / Sign up

Export Citation Format

Share Document