hadoop mapreduce Latest Research Papers

Comparative Analysis of Hadoop MapReduce and Spark Based on People’s Livelihood Appeal Data

Big Data - Communications in Computer and Information Science ◽

10.1007/978-981-16-9709-8_6 ◽

2022 ◽

pp. 71-91

Author(s):

Lixin Liang ◽

Heng Zhao ◽

Yongan Shen

Keyword(s):

Comparative Analysis ◽

Hadoop Mapreduce

Statistical Visualization of Big Data Through Hadoop Streaming in RStudio

10.4018/978-1-6684-3662-2.ch035 ◽

2022 ◽

pp. 758-787

Author(s):

Chitresh Verma ◽

Rajiv Pandey

Keyword(s):

Big Data ◽

Data Visualization ◽

Data Analytics ◽

Big Data Analytics ◽

Data Streaming ◽

Data Set ◽

Graphical Modeling ◽

Hadoop Mapreduce ◽

R Packages ◽

Case Based

Data Visualization enables visual representation of the data set for interpretation of data in a meaningful manner from human perspective. The Statistical visualization calls for various tools, algorithms and techniques that can support and render graphical modeling. This chapter shall explore on the detailed features R and RStudio. The combination of Hadoop and R for the Big Data Analytics and its data visualization shall be demonstrated through appropriate code snippets. The integration perspective of R and Hadoop is explained in detail with the help of a utility called Hadoop streaming jar. The various R packages and their integration with Hadoop operations in the R environment are explained through suitable examples. The process of data streaming is provided using different readers of Hadoop streaming package. A case based statistical project is considered in which the data set is visualized after dual execution using the Hadoop MapReduce and R script.

Exploring Big Data Analytic Approaches to Cancer Blog Text Analysis

10.4018/978-1-6684-3662-2.ch090 ◽

2022 ◽

pp. 1843-1863

Author(s):

Viju Raghupathi ◽

Yilu Zhou ◽

Wullianallur Raghupathi

Keyword(s):

Big Data ◽

Data Analytics ◽

Word Association ◽

Big Data Analytics ◽

Text Analytics ◽

Hadoop Mapreduce ◽

Cancer Management ◽

Exploratory Approach ◽

Clustering And Classification ◽

Data Analytic

In this article, the authors explore the potential of a big data analytics approach to unstructured text analytics of cancer blogs. The application is developed using Cloudera platform's Hadoop MapReduce framework. It uses several text analytics algorithms, including word count, word association, clustering, and classification, to identify and analyze the patterns and keywords in cancer blog postings. This article establishes an exploratory approach to involving big data analytics methods in developing text analytics applications for the analysis of cancer blogs. Additional insights are extracted through various means, including the development of categories or keywords contained in the blogs, the development of a taxonomy, and the examination of relationships among the categories. The application has the potential for generalizability and implementation with health content in other blogs and social media. It can provide insight and decision support for cancer management and facilitate efficient and relevant searches for information related to cancer.

High-performance computing for SARS-CoV-2 RNAs clustering: a data science‒based genomics approach

Genomics & Informatics ◽

10.5808/gi.21056 ◽

2021 ◽

Vol 19 (4) ◽

pp. e49

Author(s):

Anas Oujja ◽

Mohamed Riduan Abid ◽

Jaouad Boumhidi ◽

Safae Bourhnane ◽

Asmaa Mourhir ◽

...

Keyword(s):

High Performance Computing ◽

High Performance ◽

Data Science ◽

Longest Common Subsequence ◽

Rna Sequences ◽

Hadoop Mapreduce ◽

Common Subsequence ◽

Ict Tools ◽

Clustering Approach ◽

Performance Computing

Nowadays, Genomic data constitutes one of the fastest growing datasets in the world. As of 2025, it is supposed to become the fourth largest source of Big Data, and thus mandating adequate high-performance computing (HPC) platform for processing. With the latest unprecedented and unpredictable mutations in severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), the research community is in crucial need for ICT tools to process SARS-CoV-2 RNA data, e.g., by classifying it (i.e., clustering) and thus assisting in tracking virus mutations and predict future ones. In this paper, we are presenting an HPC-based SARS-CoV-2 RNAs clustering tool. We are adopting a data science approach, from data collection, through analysis, to visualization. In the analysis step, we present how our clustering approach leverages on HPC and the longest common subsequence (LCS) algorithm. The approach uses the Hadoop MapReduce programming paradigm and adapts the LCS algorithm in order to efficiently compute the length of the LCS for each pair of SARS-CoV-2 RNA sequences. The latter are extracted from the U.S. National Center for Biotechnology Information (NCBI) Virus repository. The computed LCS lengths are used to measure the dissimilarities between RNA sequences in order to work out existing clusters. In addition to that, we present a comparative study of the LCS algorithm performance based on variable workloads and different numbers of Hadoop worker nodes.

RDD-Eclat: Approaches to Parallelize Eclat Algorithm on Spark RDD Framework (Extended Version)

10.21203/rs.3.rs-1079576/v1 ◽

2021 ◽

Author(s):

Pankaj Singh ◽

Sudhakar Singh ◽

P K Mishra ◽

Rakhi Garg

Keyword(s):

Data Processing ◽

Iterative Algorithms ◽

Frequent Itemset ◽

Experimental Results ◽

Distributed Data ◽

Data Intensive ◽

Hadoop Mapreduce ◽

Distributed Data Processing ◽

Benchmark Datasets ◽

Processing Framework

Abstract Frequent itemset mining (FIM) is a highly computational and data intensive algorithm. Therefore, parallel and distributed FIM algorithms have been designed to process large volume of data in a reduced time. Recently, a number of FIM algorithms have been designed on Hadoop MapReduce, a distributed big data processing framework. But, due to heavy disk I/O, MapReduce is found to be inefficient for the highly iterative FIM algorithms. Therefore, Spark, a more efficient distributed data processing framework, has been developed with in-memory computation and resilient distributed dataset (RDD) features to support the iterative algorithms. On this framework, Apriori and FP-Growth based FIM algorithms have been designed on the Spark RDD framework, but Eclat-based algorithm has not been explored yet. In this paper, RDD-Eclat, a parallel Eclat algorithm on the Spark RDD framework is proposed with its five variants. The proposed algorithms are evaluated on the various benchmark datasets, and the experimental results show that RDD-Eclat outperforms the Spark-based Apriori by many times. Also, the experimental results show the scalability of the proposed algorithms on increasing the number of cores and size of the dataset.

An Efficient Parallelized Ontology Network-Based Semantic Similarity Measure for Big Biomedical Document Clustering

Computational and Mathematical Methods in Medicine ◽

10.1155/2021/7937573 ◽

2021 ◽

Vol 2021 ◽

pp. 1-11

Author(s):

Meijing Li ◽

Tianjie Chen ◽

Keun Ho Ryu ◽

Cheng Hao Jin

Keyword(s):

Semantic Similarity ◽

Clustering Algorithms ◽

Semantic Features ◽

Text Data ◽

Semantic Similarity Measure ◽

Document Similarity ◽

Hadoop Mapreduce ◽

Semantic Mining ◽

Semantic Similarity Measurement ◽

Open Datasets

Semantic mining is always a challenge for big biomedical text data. Ontology has been widely proved and used to extract semantic information. However, the process of ontology-based semantic similarity calculation is so complex that it cannot measure the similarity for big text data. To solve this problem, we propose a parallelized semantic similarity measurement method based on Hadoop MapReduce for big text data. At first, we preprocess and extract the semantic features from documents. Then, we calculate the document semantic similarity based on ontology network structure under MapReduce framework. Finally, based on the generated semantic document similarity, document clusters are generated via clustering algorithms. To validate the effectiveness, we use two kinds of open datasets. The experimental results show that the traditional methods can hardly work for more than ten thousand biomedical documents. The proposed method keeps efficient and accurate for big dataset and is of high parallelism and scalability.

MuSe: a multi-level storage scheme for big RDF data using MapReduce

Journal Of Big Data ◽

10.1186/s40537-021-00519-6 ◽

2021 ◽

Vol 8 (1) ◽

Author(s):

Tanvi Chawla ◽

Girdhari Singh ◽

Emmanuel S. Pilli

Keyword(s):

Linked Data ◽

State Of The Art ◽

Query Execution ◽

Hadoop Mapreduce ◽

Storage Scheme ◽

The Common ◽

Multi Level ◽

Rdf Data ◽

Description Framework ◽

Resource Description

AbstractResource Description Framework (RDF) model owing to its flexible structure is increasingly being used to represent Linked data. The rise in amount of Linked data and Knowledge graphs has resulted in an increase in the volume of RDF data. RDF is used to model metadata especially for social media domains where the data is linked. With the plethora of RDF data sources available on the Web, scalable RDF data management becomes a tedious task. In this paper, we present MuSe—an efficient distributed RDF storage scheme for storing and querying RDF data with Hadoop MapReduce. In MuSe, the Big RDF data is stored at two levels for answering the common triple patterns in SPARQL queries. MuSe considers the type of frequently occuring triple patterns and optimizes RDF storage to answer such triple patterns in minimum time. It accesses only the tables that are sufficient for answering a triple pattern instead of scanning the whole RDF dataset. The extensive experiments on two synthetic RDF datasets i.e. LUBM and WatDiv, show that MuSe outperforms the compared state-of-the art frameworks in terms of query execution time and scalability.

Hybrid Approach for Enhancing Performance of Genomic Data for Stream Matching

International Journal of Cognitive Informatics and Natural Intelligence ◽

10.4018/ijcini.20211001.oa38 ◽

2021 ◽

Vol 15 (4) ◽

pp. 1-18

Author(s):

Gururaj T. ◽

Siddesh G. M.

Keyword(s):

Processing Time ◽

Gene Expression Analysis ◽

Gene Sequence ◽

Hybrid Approach ◽

Basic Operation ◽

Sequence Matching ◽

Hadoop Mapreduce ◽

Proposed Model ◽

Performance Issues ◽

Memory Utilization

In gene expression analysis, the expression levels of thousands of genes are analyzed, such as separate stages of treatments or diseases. Identifying particular gene sequence pattern is a challenging task with respect to performance issues. The proposed solution addresses the performance issues in genomic stream matching by involving assembly and sequencing. Counting the k-mer based on k-input value and while performing DNA sequencing tasks, the researches need to concentrate on sequence matching. The proposed solution addresses performance issue metrics such as processing time for k-mer counting, number of operations for matching similarity, memory utilization while performing similarity search, and processing time for stream matching. By suggesting an improved algorithm, Revised Rabin Karp(RRK) for basic operation and also to achieve more efficiency, the proposed solution suggests a novel framework based on Hadoop MapReduce blended with Pig & Apache Tez. The measure of memory utilization and processing time proposed model proves its efficiency when compared to existing approaches.

Hybrid Approach for Enhancing Performance of Genomic Data for Stream Matching

International Journal of Cognitive Informatics and Natural Intelligence ◽

10.4018/ijcini.20211001oa25 ◽

2021 ◽

Vol 15 (4) ◽

pp. 0-0

Keyword(s):

Processing Time ◽

Gene Expression Analysis ◽

Gene Sequence ◽

Hybrid Approach ◽

Basic Operation ◽

Sequence Matching ◽

Hadoop Mapreduce ◽

Proposed Model ◽

Performance Issues ◽

Memory Utilization

In gene expression analysis, the expression levels of thousands of genes are analyzed, such as separate stages of treatments or diseases. Identifying particular gene sequence pattern is a challenging task with respect to performance issues. The proposed solution addresses the performance issues in genomic stream matching by involving assembly and sequencing. Counting the k-mer based on k-input value and while performing DNA sequencing tasks, the researches need to concentrate on sequence matching. The proposed solution addresses performance issue metrics such as processing time for k-mer counting, number of operations for matching similarity, memory utilization while performing similarity search, and processing time for stream matching. By suggesting an improved algorithm, Revised Rabin Karp(RRK) for basic operation and also to achieve more efficiency, the proposed solution suggests a novel framework based on Hadoop MapReduce blended with Pig & Apache Tez. The measure of memory utilization and processing time proposed model proves its efficiency when compared to existing approaches.

Benchmarking Apache Spark and Hadoop MapReduce on Big Data Classification

10.1145/3481646.3481649 ◽

2021 ◽

Author(s):

Taha Tekdogan ◽

Ali Cakmak

Keyword(s):

Big Data ◽

Data Classification ◽

Apache Spark ◽

Hadoop Mapreduce ◽

Big Data Classification

hadoop mapreduce
Recently Published Documents

TOTAL DOCUMENTS

H-INDEX

Comparative Analysis of Hadoop MapReduce and Spark Based on People’s Livelihood Appeal Data

Statistical Visualization of Big Data Through Hadoop Streaming in RStudio

Exploring Big Data Analytic Approaches to Cancer Blog Text Analysis

High-performance computing for SARS-CoV-2 RNAs clustering: a data science‒based genomics approach

RDD-Eclat: Approaches to Parallelize Eclat Algorithm on Spark RDD Framework (Extended Version)

An Efficient Parallelized Ontology Network-Based Semantic Similarity Measure for Big Biomedical Document Clustering

MuSe: a multi-level storage scheme for big RDF data using MapReduce

Hybrid Approach for Enhancing Performance of Genomic Data for Stream Matching

Hybrid Approach for Enhancing Performance of Genomic Data for Stream Matching

Benchmarking Apache Spark and Hadoop MapReduce on Big Data Classification

Export Citation Format

hadoop mapreduceRecently Published Documents

TOTAL DOCUMENTS

H-INDEX

Comparative Analysis of Hadoop MapReduce and Spark Based on People’s Livelihood Appeal Data

Statistical Visualization of Big Data Through Hadoop Streaming in RStudio

Exploring Big Data Analytic Approaches to Cancer Blog Text Analysis

High-performance computing for SARS-CoV-2 RNAs clustering: a data science‒based genomics approach

RDD-Eclat: Approaches to Parallelize Eclat Algorithm on Spark RDD Framework (Extended Version)

An Efficient Parallelized Ontology Network-Based Semantic Similarity Measure for Big Biomedical Document Clustering

MuSe: a multi-level storage scheme for big RDF data using MapReduce

Hybrid Approach for Enhancing Performance of Genomic Data for Stream Matching

Hybrid Approach for Enhancing Performance of Genomic Data for Stream Matching

Benchmarking Apache Spark and Hadoop MapReduce on Big Data Classification

hadoop mapreduce
Recently Published Documents