Advances in Data Mining and Database Management - Big Data Processing With Hadoop
Latest Publications


TOTAL DOCUMENTS

10
(FIVE YEARS 0)

H-INDEX

0
(FIVE YEARS 0)

Published By IGI Global

9781522537908, 9781522537915

The previous chapter overviewed big data including its types, sources, analytic techniques, and applications. This chapter briefly discusses the architecture components dealing with the huge volume of data. The complexity of big data types defines a logical architecture with layers and high-level components to obtain a big data solution that includes data sources with the relation to atomic patterns. The dimensions of the approach include volume, variety, velocity, veracity, and governance. The diverse layers of the architecture are big data sources, data massaging and store layer, analysis layer, and consumption layer. Big data sources are data collected from various sources to perform analytics by data scientists. Data can be from internal and external sources. Internal sources comprise transactional data, device sensors, business documents, internal files, etc. External sources can be from social network profiles, geographical data, data stores, etc. Data massage is the process of extracting data by preprocessing like removal of missing values, dimensionality reduction, and noise removal to attain a useful format to be stored. Analysis layer is to provide insight with preferred analytics techniques and tools. The analytics methods, issues to be considered, requirements, and tools are widely mentioned. Consumption layer being the result of business insight can be outsourced to sources like retail marketing, public sector, financial body, and media. Finally, a case study of architectural drivers is applied on a retail industry application and its challenges and usecases are discussed.


One of the factors for the reliability of the services is authentication, which decides who can access what services. Since big data offers a wide variety of services, authentication becomes one of the main criteria for consideration. This chapter outlines the features of the security services in terms of the requirements and the issues in the business services. This chapter also gives a little background about the services in the cloud and the interaction between clients and services in the cloud, emphasizing the security services. The authentication procedure with the authentication protocol, Kerberos SPNEGO, which is offered as a security service in Hadoop, is introduced. The configuration details in a typical browser (Mozilla Firefox) are detailed. The usage of the Linux command curl is introduced in this chapter. The command to key distribution center “kinit” is outlined. Also, the procedure for accessing the server within the Java code is given. A section on server-side configuration speaks about the Maven repository, which holds all the necessary library Jar files organized as local, central, and remote. The explanation for the configuration is given with a typical XML file. Also, the usage of Simple Logging Facade for Java is introduced. The configuration has many parameters with its values and they are tabulated for better perception. The use of LDAP server, which is one of the lightweight directory access protocols, is introduced. Also, the provision for multi-scheme configuration is outlined with an example configuration file. The facilities available to provide advanced security features using signer secret provide are highlighted with appropriate examples for the parameter name and parameter value.


Hadoop Distributed File System, which is popularly known as HDFS, is a Java-based distributed file system running on commodity machines. HDFS is basically meant for storing Big Data over distributed commodity machines and getting the work done at a faster rate due to the processing of data in a distributed manner. Basically, HDFS has one name node (master node) and cluster of data nodes (slave nodes). The HDFS files are divided into blocks. The block is the minimum amount of data (64 MB) that can be read or written. The functions of the name node are to master the slave nodes, to maintain the file system, to control client access, and to have control of the replications. To ensure the availability of the name node, a standby name node is deployed by failover control and fencing is done to avoid the activation of the primary name node during failover. The functions of the data nodes are to store the data, serve the read and write requests, replicate the blocks, maintain the liveness of the node, ensure the storage policy, and maintain the block cache size. Also, it ensures the availability of data.


As the name indicates, this chapter explains the evolution of Hadoop. Doug Cutting started a text search library called Lucene. After joining Apache Software Foundation, he modified it into a web crawler called Apache Nutch. Then Google File System was taken as reference and modified as Nutch Distributed File System. Then Google's MapReduce features were also integrated and Hadoop was framed. The whole path from Lucene to Apache Hadoop is illustrated in this chapter. Also, the different versions of Hadoop are explained. The procedure to download the software is explained. The mechanism to verify the downloaded software is shown. Then the architecture of Hadoop is detailed. The Hadoop cluster is a set of commodity machines grouped together. The arrangement of Hadoop machines in different racks is shown. After reading this chapter, the reader will understand how Hadoop has evolved and its entire architecture.


As the name indicates, this chapter explains the various additional tools provided by Hadoop. The additional tools provided by Hadoop distribution are Hadoop Streaming, Hadoop Archives, DistCp, Rumen, GridMix, and Scheduler Load Simulator. Hadoop Streaming is a utility that allows the user to have any executable or script for both mapper and reducer. Hadoop Archives is used for archiving old files and directories. DistCp is used for copying files within the cluster and also across different clusters. Rumen is the tool for extracting meaningful data from JobHistory files and analyzes it. It is used for statistical analysis. GridMix is benchmark for Hadoop. It takes a trace of job and creates a synthetic job with the same pattern as that of trace. The trace can be generated by Rumen tool. Scheduler Load Simulator is a tool for simulating different loads and scheduling methods like FIFO, Fair Scheduler, etc. This chapter explains all the tools and gives the syntax of various commands for each tool. After reading this chapter, the reader will be able to use all these tools effectively.


Apache Hadoop includes Java APIs for different functions on a HDFS file system like creation of a file, renaming, deletion, and to set read-write permissions for directories. This can be done on a single and cluster of systems. In addition, REST (REpresentational State Transfer) APIs is a collection of web services to provide interoperability between a single system and an interconnected distributed network. REST is chosen for its speedy performance, scalability, simplicity, and reliability. YARN REST and MapReduce REST APIs are briefly discussed in this chapter. YARN web service REST API includes URI resources through which the cluster information, nodes, and application information can be accessed. YARN is comprised of Resource manager, node manager, and timeline REST APIs. The application has HTTP request as resource and the response can be in the form XML or JSON. The request URI, response status, header, and body are defined in actual format. Similarly, the REST API is used for MapReduce that comprises the details about the jobs running with the information such as number of tasks, counters, and attempts. Hence, the REST APIs on YARN and resource manager create small modules as a response when a resource is requested. An outline of the research and growth of REST APIs is included in this chapter.


Apache Hadoop YARN (Yet Another Resource Negotiator) is the cluster resource management technology in Hadoop version 2. The YARN provides multiple accesses for batch and real-time processing across multiple clusters and has the benefit over utilization of cluster resources during dynamic allocation. The chapter shows the YARN architecture, schedulers, resource manager phases, YARN applications, commands, and timeline server. The architecture of YARN splits the task into resource management and job scheduling. This is managed by Resource Manager and Node Manager. The chapter addresses the Timeline Server, which stores and retrieves the present and past information in a generic way.


Apache Hadoop is an open source framework for storage and processing massive amounts of data. The skeleton of Hadoop can be viewed as distributed computing across a cluster of computers. This chapter deals with the single node, multinode setup of Hadoop environment along with the Hadoop user commands and administration commands. Hadoop processes the data on a cluster of machines with commodity hardware. It has two components, Hadoop Distributed File System for storage and Map Reduce/YARN for processing. Single node processing can be done through standalone or pseudo-distributed mode whereas multinode is through cluster mode. The execution procedure for each environment is briefly stated. Then the chapter explores the Hadoop user commands for operations like copying to and from files in distributed file systems, running jar, creating archive, setting version, classpath, etc. Further, Hadoop administration manages the configuration including functions like cluster balance, running the dfs, MapReduce admin, namenode, secondary namenode, etc.


Big data is now a reality. Data is created constantly. Data from mobile phones, social media, GIS, imaging technologies for medical diagnosis, etc., all these must be stored for some purpose. Also, this data needs to be stored and processed in real time. The challenging task here is to store this vast amount of data and to manage it with meaningful patterns and traditional data structures for processing. Data sources are expanding to grow into 50X in the next 10 years. An International Data Corporation (IDC) forecast sees that big data technology and services market at a compound annual growth rate (CAGR) of 23.1% over 2014-19 period with annual spending may reach $48.6 billion in 2019. The digital universe is expected to double the data size in next two years and by 2020 we may reach 44 zettabytes (1021) or 44 trillion gigabytes. The zettabyte is a multiple of the unit byte for digital information. There is a need to design new data architecture with new analytical sandboxes and methods with an integration of multiple skills for a data scientist to operate on such large data.


The second major component of Hadoop is MapReduce. It is the software framework for Hadoop environment. It consists of a single resource manager, one node manager per node, and one application manager per application. These managers are responsible for allocating necessary resources and executing the jobs submitted by clients. The entire process of executing a job is narrated in this chapter. The architecture of MapReduce framework is explained. The execution is implemented through two major operations: map and reduce. The map and reduce operations are demonstrated with an example. The syntax of different user interfaces available is shown. The coding to be done for MapReduce programming is shown using Java. The entire cycle of job execution is shown. After reading this chapter, the reader will be able to write MapReduce programs and execute them. At the end of the chapter, some research issues in the MapReduce programming is outlined.


Sign in / Sign up

Export Citation Format

Share Document