Implementation of Big-Data Applications Using Map Reduce Framework

Clustering As a result of the rapid development in cloud computing, it & fundamental to investigate the performance of extraordinary Hadoop MapReduce purposes and to realize the performance bottleneck in a cloud cluster that contributes to higher or diminish performance. It is usually primary to research the underlying hardware in cloud cluster servers to permit the optimization of program and hardware to achieve the highest performance feasible. Hadoop is founded on MapReduce, which is among the most popular programming items for huge knowledge analysis in a parallel computing environment. In this paper, we reward a particular efficiency analysis, characterization, and evaluation of Hadoop MapReduce Word Count utility. The main aim of this paper is to give implements of Hadoop map-reduce programming by giving a hands-on experience in developing Hadoop based Word-Count and Apriori application. Word count problem using Hadoop Map Reduce framework. The Apriori Algorithm has been used for finding frequent item set using Map Reduce framework.

Download Full-text

A Survey on Implementation of Word-Count with Map Reduce Programming Oriented Model using Hadoop Framework

SSRN Electronic Journal ◽

10.2139/ssrn.3351074 ◽

2019 ◽

Author(s):

Santosh Yadav ◽

Jay Prakash

Keyword(s):

Map Reduce ◽

Word Count ◽

Hadoop Framework

Download Full-text

Cloud-based intelligent self-diagnosis and department recommendation service using Chinese medical BERT

Journal of Cloud Computing Advances Systems and Applications ◽

10.1186/s13677-020-00218-2 ◽

2021 ◽

Vol 10 (1) ◽

Author(s):

Junshu Wang ◽

Guoming Zhang ◽

Wei Wang ◽

Ka Zhang ◽

Yehua Sheng

Keyword(s):

Cloud Computing ◽

Large Scale ◽

Medical Service ◽

Rapid Development ◽

Medical Knowledge ◽

Language Models ◽

Computing Environment ◽

Computing Power ◽

Cloud Computing Environment ◽

Proposed Model

AbstractWith the rapid development of hospital informatization and Internet medical service in recent years, most hospitals have launched online hospital appointment registration systems to remove patient queues and improve the efficiency of medical services. However, most of the patients lack professional medical knowledge and have no idea of how to choose department when registering. To instruct the patients to seek medical care and register effectively, we proposed CIDRS, an intelligent self-diagnosis and department recommendation framework based on Chinese medical Bidirectional Encoder Representations from Transformers (BERT) in the cloud computing environment. We also established a Chinese BERT model (CHMBERT) trained on a large-scale Chinese medical text corpus. This model was used to optimize self-diagnosis and department recommendation tasks. To solve the limited computing power of terminals, we deployed the proposed framework in a cloud computing environment based on container and micro-service technologies. Real-world medical datasets from hospitals were used in the experiments, and results showed that the proposed model was superior to the traditional deep learning models and other pre-trained language models in terms of performance.

Download Full-text

Computational storage: an efficient and scalable platform for big data and HPC applications

Journal Of Big Data ◽

10.1186/s40537-019-0265-5 ◽

2019 ◽

Vol 6 (1) ◽

Cited By ~ 2

Author(s):

Mahdi Torabzadehkashi ◽

Siavash Rezaei ◽

Ali HeydariGorji ◽

Hosein Bobarshad ◽

Vladimir Alves ◽

...

Keyword(s):

Big Data ◽

High Performance ◽

Distributed Processing ◽

Data Access ◽

Distributed Applications ◽

Process Data ◽

Storage Devices ◽

Hadoop Mapreduce ◽

Big Data Applications ◽

Application Processor

AbstractIn the era of big data applications, the demand for more sophisticated data centers and high-performance data processing mechanisms is increasing drastically. Data are originally stored in storage systems. To process data, application servers need to fetch them from storage devices, which imposes the cost of moving data to the system. This cost has a direct relation with the distance of processing engines from the data. This is the key motivation for the emergence of distributed processing platforms such as Hadoop, which move process closer to data. Computational storage devices (CSDs) push the “move process to data” paradigm to its ultimate boundaries by deploying embedded processing engines inside storage devices to process data. In this paper, we introduce Catalina, an efficient and flexible computational storage platform, that provides a seamless environment to process data in-place. Catalina is the first CSD equipped with a dedicated application processor running a full-fledged operating system that provides filesystem-level data access for the applications. Thus, a vast spectrum of applications can be ported for running on Catalina CSDs. Due to these unique features, to the best of our knowledge, Catalina CSD is the only in-storage processing platform that can be seamlessly deployed in clusters to run distributed applications such as Hadoop MapReduce and HPC applications in-place without any modifications on the underlying distributed processing framework. For the proof of concept, we build a fully functional Catalina prototype and a CSD-equipped platform using 16 Catalina CSDs to run Intel HiBench Hadoop and HPC benchmarks to investigate the benefits of deploying Catalina CSDs in the distributed processing environments. The experimental results show up to 2.2× improvement in performance and 4.3× reduction in energy consumption, respectively, for running Hadoop MapReduce benchmarks. Additionally, thanks to the Neon SIMD engines, the performance and energy efficiency of DFT algorithms are improved up to 5.4× and 8.9×, respectively.

Download Full-text

The Design and Evaluation of a Hands-on Course on Cloud Computing Environment

2021 IEEE Global Engineering Education Conference (EDUCON) ◽

10.1109/educon46332.2021.9454116 ◽

2021 ◽

Author(s):

Muhammad WANNOUS

Keyword(s):

Cloud Computing ◽

Computing Environment ◽

Cloud Computing Environment ◽

Hands On

Download Full-text

Big Data

Security, Privacy, and Forensics Issues in Big Data - Advances in Information Security, Privacy, and Ethics ◽

10.4018/978-1-5225-9742-1.ch002 ◽

2020 ◽

pp. 24-65

Author(s):

P. Lalitha Surya Kumari

Keyword(s):

Big Data ◽

Life Cycle ◽

Data Collection ◽

Operating Systems ◽

Data Security ◽

Map Reduce ◽

Security Issues ◽

Big Data Applications

This chapter gives information about the most important aspects in how computing infrastructures should be configured and intelligently managed to fulfill the most notably security aspects required by big data applications. Big data is one area where we can store, extract, and process a large amount of data. All these data are very often unstructured. Using big data, security functions are required to work over the heterogeneous composition of diverse hardware, operating systems, and network domains. A clearly defined security boundary like firewalls and demilitarized zones (DMZs), conventional security solutions, are not effective for big data as it expands with the help of public clouds. This chapter discusses the different concepts like characteristics, risks, life cycle, and data collection of big data, map reduce components, issues and challenges in big data, cloud secure alliance, approaches to solve security issues, introduction of cybercrime, YARN, and Hadoop components.

Download Full-text

Predicting Short-Term Electricity Demand by Combining the Advantages of ARMA and XGBoost in Fog Computing Environment

Wireless Communications and Mobile Computing ◽

10.1155/2018/5018053 ◽

2018 ◽

Vol 2018 ◽

pp. 1-18 ◽

Cited By ~ 14

Author(s):

Chuanbin Li ◽

Xiaosen Zheng ◽

Zikun Yang ◽

Li Kuang

Keyword(s):

Clustering Algorithm ◽

Rapid Development ◽

Fog Computing ◽

Electricity Consumption ◽

Electricity Demand ◽

Historical Records ◽

Computing Environment ◽

Classical Models ◽

Cloud Framework ◽

Computing Framework

With the rapid development of IoT, the disadvantages of Cloud framework have been exposed, such as high latency, network congestion, and low reliability. Therefore, the Fog Computing framework has emerged, with an extended Fog Layer between the Cloud and terminals. In order to address the real-time prediction on electricity demand, we propose an approach based on XGBoost and ARMA in Fog Computing environment. By taking the advantages of Fog Computing framework, we first propose a prototype-based clustering algorithm to divide enterprise users into several categories based on their total electricity consumption; we then propose a model selection approach by analyzing users’ historical records of electricity consumption and identifying the most important features. Generally speaking, if the historical records pass the test of stationarity and white noise, ARMA is used to model the user’s electricity consumption in time sequence; otherwise, if the historical records do not pass the test, and some discrete features are the most important, such as weather and whether it is weekend, XGBoost will be used. The experiment results show that our proposed approach by combining the advantage of ARMA and XGBoost is more accurate than the classical models.

Download Full-text

The Research of PApriori Algrithm Based on OLAP

Advanced Materials Research ◽

10.4028/www.scientific.net/amr.532-533.1645 ◽

2012 ◽

Vol 532-533 ◽

pp. 1645-1648

Author(s):

Ya Long Ma ◽

Guo Qing Zhang ◽

Wen Chao Xu ◽

Lan Juan Tong

Keyword(s):

Complex System ◽

Association Rules ◽

Efficiency Analysis ◽

Apriori Algorithm ◽

Massive Number

In analysis of association rules based on OLAP, the application of Apriori algorithm may need to scan the database frequently and generate massive number of candidate itemsets because of the complex system. According to the weakness, the paper proposes PApriori algorithm based on pretreatment, which reduces the time of scanning database to once, while not directly generate candidate itemsets, raise the efficiency Analysis of association rules based on OLAP efficiently.

Download Full-text

Applying compression algorithms on hadoop cluster implementing through apache tez and hadoop mapreduce

International Journal of Engineering & Technology ◽

10.14419/ijet.v7i2.26.12539 ◽

2018 ◽

Vol 7 (2.26) ◽

pp. 80

Author(s):

Dr E. Laxmi Lydia ◽

M Srinivasa Rao

Keyword(s):

Big Data ◽

Execution Time ◽

Big Data Analytics ◽

Research Area ◽

Word Count ◽

Hadoop Mapreduce ◽

Interactive Query ◽

Hadoop Distributed File System ◽

Hadoop Cluster ◽

Compressed Data

The latest and famous subject all over the cloud research area is Big Data; its main appearances are volume, velocity and variety. The characteristics are difficult to manage through traditional software and their various available methodologies. To manage the data which is occurring from various domains of big data are handled through Hadoop, which is open framework software which is mainly developed to provide solutions. Handling of big data analytics is done through Hadoop Map Reduce framework and it is the key engine of hadoop cluster and it is extensively used in these days. It uses batch processing system.Apache developed an engine named "Tez", which supports interactive query system and it won't writes any temporary data into the Hadoop Distributed File System(HDFS).The paper mainly focuses on performance juxtaposition of MapReduce and TeZ, performance of these two engines are examined through the compression of input files and map output files. To compare two engines we used Bzip compression algorithm for the input files and snappy for the map out files. Word Count and Terasort gauge are used on our experiments. For the Word Count gauge, the results shown that Tez engine has better execution time than Hadoop MapReduce engine for the both compressed and non-compressed data. It has reduced the execution time nearly 39% comparing to the execution time of the Hadoop MapReduce engine. Correspondingly for the terasort gauge, the Tez engine has higher execution time than Hadoop MapReduce engine.

Download Full-text