Applying compression algorithms on hadoop cluster implementing through apache tez and hadoop mapreduce

2018 ◽  
Vol 7 (2.26) ◽  
pp. 80
Author(s):  
Dr E. Laxmi Lydia ◽  
M Srinivasa Rao

The latest and famous subject all over the cloud research area is Big Data; its main appearances are volume, velocity and variety. The characteristics are difficult to manage through traditional software and their various available methodologies. To manage the data which is occurring from various domains of big data are handled through Hadoop, which is open framework software which is mainly developed to provide solutions. Handling of big data analytics is done through Hadoop Map Reduce framework and it is the key engine of hadoop cluster and it is extensively used in these days. It uses batch processing system.Apache developed an engine named "Tez", which supports interactive query system and it won't writes any temporary data into the Hadoop Distributed File System(HDFS).The paper mainly focuses on performance juxtaposition of MapReduce and TeZ, performance of these two engines are examined through the compression of input files and map output files. To compare two engines we used Bzip compression algorithm for the input files and snappy for the map out files. Word Count and Terasort gauge are used on our experiments. For the Word Count gauge, the results shown that Tez engine has better execution time than Hadoop MapReduce engine for the both compressed and non-compressed data. It has reduced the execution time nearly 39% comparing to the execution time of the Hadoop MapReduce engine. Correspondingly for the terasort gauge, the Tez engine has higher execution time than Hadoop MapReduce engine.  

2020 ◽  
Vol 34 (28) ◽  
pp. 2050311
Author(s):  
Satvik Vats ◽  
B. B. Sagar

In Big data domain, platform dependency can alter the behavior of the business. It is because of the different kinds (Structured, Semi-structured and Unstructured) and characteristics of the data. By the traditional infrastructure, different kinds of data cannot be processed simultaneously due to their platform dependency for a particular task. Therefore, the responsibility of selecting suitable tools lies with the user. The variety of data generated by different sources requires the selection of suitable tools without human intervention. Further, these tools also face the limitation of recourses to deal with a large volume of data. This limitation of resources affects the performance of the tools in terms of execution time. Therefore, in this work, we proposed a model in which different data analytics tools share a common infrastructure to provide data independence and resource sharing environment, i.e. the proposed model shares common (Hybrid) Hadoop Distributed File System (HDFS) between three Name-Node (Master Node), three Data-Node and one Client-node, which works under the DeMilitarized zone (DMZ). To realize this model, we have implemented Mahout, R-Hadoop and Splunk sharing a common HDFS. Further using our model, we run [Formula: see text]-means clustering, Naïve Bayes and recommender algorithms on three different datasets, movie rating, newsgroup, and Spam SMS dataset, representing structured, semi-structured and unstructured, respectively. Our model selected the appropriate tool, e.g. Mahout to run on the newsgroup dataset as other tools cannot run on this data. This shows that our model provides data independence. Further results of our proposed model are compared with the legacy (individual) model in terms of execution time and scalability. The improved performance of the proposed model establishes the hypothesis that our model overcomes the limitation of the resources of the legacy model.


2020 ◽  
Vol 13 (4) ◽  
pp. 790-797
Author(s):  
Gurjit Singh Bhathal ◽  
Amardeep Singh Dhiman

Background: In current scenario of internet, large amounts of data are generated and processed. Hadoop framework is widely used to store and process big data in a highly distributed manner. It is argued that Hadoop Framework is not mature enough to deal with the current cyberattacks on the data. Objective: The main objective of the proposed work is to provide a complete security approach comprising of authorisation and authentication for the user and the Hadoop cluster nodes and to secure the data at rest as well as in transit. Methods: The proposed algorithm uses Kerberos network authentication protocol for authorisation and authentication and to validate the users and the cluster nodes. The Ciphertext-Policy Attribute- Based Encryption (CP-ABE) is used for data at rest and data in transit. User encrypts the file with their own set of attributes and stores on Hadoop Distributed File System. Only intended users can decrypt that file with matching parameters. Results: The proposed algorithm was implemented with data sets of different sizes. The data was processed with and without encryption. The results show little difference in processing time. The performance was affected in range of 0.8% to 3.1%, which includes impact of other factors also, like system configuration, the number of parallel jobs running and virtual environment. Conclusion: The solutions available for handling the big data security problems faced in Hadoop framework are inefficient or incomplete. A complete security framework is proposed for Hadoop Environment. The solution is experimentally proven to have little effect on the performance of the system for datasets of different sizes.


Author(s):  
Smys S

The failures in the most of research area, identified that the lack of details about the actionable and the valuable data that conceived actual solutions were the core of the crisis, this was very true in case of the health care industry where even the early diagnoses of a chronic disease could not save a person’s life. This because of the impossibility in the prediction of the individual’s outcomes in the entire population. The evolving new technologies have changed this scenario leveraging the mobile devices and the internet services such as the sensor network and the smart monitors, enhancing the practical healthcare using the predictive modeling acquiring a deeper individual measures. This affords the researches to go through the huge set of data and identify the patterns along with the trends and delivering solutions improvising the medical care, minimizing the cost and he regulating the health admittance, ensuring the safety of human lives. The paper provides the survey on the predictive big data analysis and accuracy it provides in the health care system.


2021 ◽  
Vol 22 (3) ◽  
pp. 303-312
Author(s):  
Jitali Patel ◽  
Ruhi Patel ◽  
Saumya Shah ◽  
Jigna Ashish Patel

Big data analytics involve systematic approach to find hidden patterns to help the organization grow from large volume and variety of data. In recent years big data analytics is widely used in the agricultural domain to improve yield. Viticulture (the cultivation of grapes) is one of the most lucrative farming in India. It is a subdivision of horticulture and is the study of wine growing. The demand for Indian Wine is increasing at about 27% each year since the 21st century and thus more and more ways are being developed to improve the quality and quantity of the wine products. In this paper, we focus on a specific agricultural practice as viticulture. Weather forecasting and disease detection are the two main research areas in precision viticulture. Leaf disease detection as a part of plant pathology is the key research area in this paper. It can be applied on vineyards of India where farmers are bereft of the latest technologies. Proposed system architecture comprises four modules: Data collection, data preprocessing, classification and visualization. Database module involve grape leaf dataset, consists of healthy images combined with disease leaves such as Black measles, Black rot, and Leaf blight. Models have been implemented on Apache Hadoop using map reduce programming framework. It apply feature extraction to extract various features of the live images and classification algorithm with reduced computational complexity. Gray Level Co-occurrence Matrix (GLCM) followed by K-Nearest Neighborhood (KNN) algorithm. System also recommends the necessary steps and remedies that the viticulturists can take to assure that the grapes can be salvaged at the right time and in the right manner based on classification results. Overall system will help Indian viticulturists to improve the harvesting process. Accuracy of the model is 72% and it can be increased as a future work by including deep learning with time series grape leaf images.  


Author(s):  
Vinay Kellengere Shankarnarayan

In recent years, big data have gained massive popularity among researchers, decision analysts, and data architects in any enterprise. Big data had been just another way of saying analytics. In today's world, the company's capital lies with big data. Think of worlds huge companies. The value they offer comes from their data, which they analyze for their proactive benefits. This chapter showcases the insight of big data and its tools and techniques the companies have adopted to deal with data problems. The authors also focus on framework and methodologies to handle the massive data in order to make more accurate and precise decisions. The chapter begins with the current organizational scenario and what is meant by big data. Next, it draws out various challenges faced by organizations. The authors also observe big data business models and different frameworks available and how it has been categorized and finally the conclusion discusses the challenges and what is the future perspective of this research area.


Information ◽  
2019 ◽  
Vol 11 (1) ◽  
pp. 17 ◽  
Author(s):  
Laden Husamaldin ◽  
Nagham Saeed

Big data analytics (BDA) is an increasingly popular research area for both organisations and academia due to its usefulness in facilitating human understanding and communication. In the literature, researchers have focused on classifying big data according to data type, data security or level of difficulty, and many research papers reveal that there is a lack of information on evidence of a real-world link of big data analytics methods and its associated techniques. Thus, many organisations are still struggling to realise the actual value of big data analytic methods and its associated techniques. Therefore, this paper gives a design research account for formulating and proposing a step ahead to understand the relation between the analytical methods and its associated techniques. Furthermore, this paper is an attempt to clarify this uncertainty and identify the difference between analytics methods and techniques by giving clear definitions for each method and its associated techniques to integrate them later in a new correlation taxonomy based on the research approaches. Thus, the primary outcome of this research is to achieve for the first time a correlation taxonomy combining analytic methods used for big data and its recommended techniques that are compatible for various sectors. This investigation was done through studying various descriptive articles of big data analytics methods and its associated techniques in different industries.


Author(s):  
Viju Raghupathi ◽  
Yilu Zhou ◽  
Wullianallur Raghupathi

In this article, the authors explore the potential of a big data analytics approach to unstructured text analytics of cancer blogs. The application is developed using Cloudera platform's Hadoop MapReduce framework. It uses several text analytics algorithms, including word count, word association, clustering, and classification, to identify and analyze the patterns and keywords in cancer blog postings. This article establishes an exploratory approach to involving big data analytics methods in developing text analytics applications for the analysis of cancer blogs. Additional insights are extracted through various means, including the development of categories or keywords contained in the blogs, the development of a taxonomy, and the examination of relationships among the categories. The application has the potential for generalizability and implementation with health content in other blogs and social media. It can provide insight and decision support for cancer management and facilitate efficient and relevant searches for information related to cancer.


2016 ◽  
Vol 15 (8) ◽  
pp. 6991-6998
Author(s):  
Idris Hanafi ◽  
Amal Abdel-Raouf

The increasing amount and size of data being handled by data analytic applications running on Hadoop has created a need for faster data processing. One of the effective methods for handling big data sizes is compression. Data compression not only makes network I/O processing faster, but also provides better utilization of resources. However, this approach defeats one of Hadoop’s main purposes, which is the parallelism of map and reduce tasks. The number of map tasks created is determined by the size of the file, so by compressing a large file, the number of mappers is reduced which in turn decreases parallelism. Consequently, standard Hadoop takes longer times to process. In this paper, we propose the design and implementation of a Parallel Compressed File Decompressor (P-Codec) that improves the performance of Hadoop when processing compressed data. P-Codec includes two modules; the first module decompresses data upon retrieval by a data node during the phase of uploading the data to the Hadoop Distributed File System (HDFS). This process reduces the runtime of a job by removing the burden of decompression during the MapReduce phase. The second P-Codec module is a decompressed map task divider that increases parallelism by dynamically changing the map task split sizes based on the size of the final decompressed block. Our experimental results using five different MapReduce benchmarks show an average improvement of approximately 80% compared to standard Hadoop.


The big data is one of the fastest growing technologies, which can to handle huge amounts of data from various sources, such as social media, web logs, banking and business sectors etc. In order to pace with the changes in the data patterns and to accommodate the requirements of big data analytics, the platform for storage and processing such as Hadoop, also requires great advancements. Hadoop, an open source project executes the big data processing job in map and reduce phases and follows master-slave architecture. A Hadoop MapReduce job can be delayed if one of its many tasks is being assigned to an unreliable or congested machine. To solve this straggler problem, a novel algorithm design of speculative execution schemes for parallel processing clusters, from an optimization perspective, under different loading conditions is proposed. For the lightly loaded case, a task cloning scheme, namely, the combined file task cloning algorithm, which is based on maximizing the overall system utility, a straggler detection algorithm is proposed based on a workload threshold. The detection and cloning of tasks assigned with the stragglers only will not be enough to enhance the performance unless cloning of tasks is allocated in a resource aware method. So, a method is proposed which identifies and optimizes the resource allocation by considering all possible aspects of cluster performance balancing. One main issue arises due to the pre configuration of distinct map and reduce slots based on the number of files in the input folder. This can cause severe under-utilization of slot as map slots might not be fully utilized with respect to the input splits. To solve this issue, an alternative technique of Hadoop Slot Allocation is introduced in this paper by keeping the efficient management of slots model. The combine file task cloning algorithm combines the files which are less than the size of a single data block and executes them in the highly performing data node. On implementing these efficient cloning and combining techniques on a heavily loaded cluster after detecting the straggler, machine is found to reduce the elapsed time of execution to an average of 40%. The detection algorithm improves the overall performance of the heavily loaded cluster by 20% of the total elapsed time in comparison with the native Hadoop algorithm.


2021 ◽  
Author(s):  
Saravanan A.M. ◽  
K. Loheswaran ◽  
G. Naga Rama Devi ◽  
Karuppathal R ◽  
C Balakrishnan ◽  
...  

Abstract Increasing of humanity and development of Internet resources, storage size is growing with each day, whereby digital records are accessible in clouds of an exploratory format. The immediate future of Big Data is coming shortly for almost all other sectors. Big data can aid in the metamorphosis of significant company operations by offering a recommended and reliable overview of available data. Big data has also figured prominently in the detection of violence. Present framework for designing Big data implementations is capable of processing vast quantities of data through Big data analytics using collections of computing devices together to execute complex processing. Furthermore, existing technologies have not been built to fulfil the specifications of time-critical application areas and are far more oriented on real applications than on time-critical ones. This paper proposes the lightweight architecture called Yet Another Resource Negotiator (YARN), which focuses on the concept of a time-critical big-data system from the perspective of specifications and analyses the essential principles of several common big-data implementations. YARN as the normal computational framework to help MapReduce and another application instances within that Hadoop cluster. YARN requires multiple programs to execute concurrently on a constitutive common server and assent programs to delegate services depending on need. The final evaluation is accompanied by problems stemming from infrastructure and services that serve applications, recommend frameworkand provide preliminary efficiency behaviours that often contribute system impacts to implementation reliability.


Sign in / Sign up

Export Citation Format

Share Document