hadoop clusters Latest Research Papers

Motivation: Core sequencing facilities produce huge amounts of sequencing data that need to be analysed with automated workflows to ensure reproducibility and traceability. Eoulsan is a versatile open-source workflow engine meeting the needs of core facilities, by automating the analysis of a large number of samples. Its core design separates the description of the workflow from the actual commands to be run. This originality simplifies its usage as the user does not need to handle code, while ensuring reproducibility. Eoulsan was initially developed for bulk RNA-seq data, but the transcriptomics applications have recently widened with the advent of long-read sequencing and single-cell technologies, calling for the development of new workflows. Result: We present Eoulsan 2, a major update that (i) enhances the workflow manager itself, (ii) facilitates the development of new modules, and (iii) expands its applications to long reads RNA-seq (Oxford Nanopore Technologies) and scRNA-seq (Smart-seq2 and 10x Genomics). The workflow manager has been rewritten, with support for execution on a larger choice of computational infrastructure (workstations, Hadoop clusters, and various job schedulers for cluster usage). Eoulsan now facilitates the development of new modules, by reusing wrappers developed for the Galaxy platform, with support for container images (Docker or Singularity) packaging tools to execute. Finally, Eoulsan natively integrates novel modules for bulk RNA-seq, as well as others specifically designed for processing long read RNA-seq and scRNA-seq. Eoulsan 2 is distributed with ready-to-use workflows and companion tutorials. Availability and implementation: Eoulsan is implemented in Java, supported on Linux systems and distributed under the LGPL and CeCILL-C licenses at: http://outils.genomique.biologie.ens.fr/eoulsan/. The source code and sample workflows are available on GitHub: https://github.com/GenomicParisCentre/eoulsan. A GitHub repository for modules using the Galaxy tool XML syntax is further provided at: https://github.com/GenomicParisCentre/eoulsan-tools

Download Full-text

An Approach in Big Data Analytics to Improve the Velocity of Unstructured Data Using MapReduce

International Journal of System Dynamics Applications ◽

10.4018/ijsda.20211001.oa6 ◽

2021 ◽

Vol 10 (4) ◽

pp. 1-25

Author(s):

Sundarakumar M. R. ◽

Mahadevan G. ◽

Ramasubbareddy Somula ◽

Sankar Sennan ◽

Bharat S. Rawal

Keyword(s):

Big Data ◽

Time Delay ◽

Data Processing ◽

Data Analytics ◽

Big Data Analytics ◽

Data Retrieval ◽

High Volume ◽

Minimum Latency ◽

Hadoop Clusters ◽

Search Index

Big Data Analytics is an innovative approach for extracting the data from a huge volume of data warehouse systems. It reveals the method to compress the high volume of data into clusters by MapReduce and HDFS. However, the data processing has taken more time for extract and store in Hadoop clusters. The proposed system deals with the challenges of time delay in shuffle phase of map-reduce due to scheduling and sequencing. For improving the speed of big data, this proposed work using the Compressed Elastic Search Index (CESI) and MapReduce-Based Next Generation Sequencing Approach (MRBNGSA). This approach helps to increase the speed of data retrieval from HDFS clusters because of the way it is stored in that. this method is stored only the metadata in HDFS which takes less memory during runtime compare to big data due to the volume of data stored in HDFS. This approach is reduces the CPU utilization and memory allocation of the resource manager in Hadoop Framework and imroves data processing speed, such a way that time delay has to be reduced with minimum latency.

Download Full-text

An Approach in Big Data Analytics to improve the velocity of unstructured data Using Map Reduce

International Journal of System Dynamics Applications ◽

10.4018/ijsda.20211001oa06 ◽

2021 ◽

Vol 10 (4) ◽

pp. 0-0

Keyword(s):

Big Data ◽

Time Delay ◽

Data Processing ◽

Data Analytics ◽

Big Data Analytics ◽

Data Retrieval ◽

High Volume ◽

Map Reduce ◽

Hadoop Clusters ◽

Search Index

Big Data Analytics is an innovative approach for extracting the data from a huge volume of data warehouse systems. It reveals the method to compress the high volume of data into clusters by MapReduce and HDFS. However, the data processing has taken more time for extract and store in Hadoop clusters. The proposed system deals with the challenges of time delay in shuffle phase of map-reduce due to scheduling and sequencing. For improving the speed of big data, this proposed work using the Compressed Elastic Search Index (CESI) and MapReduce-Based Next Generation Sequencing Approach (MRBNGSA). This approach helps to increase the speed of data retrieval from HDFS clusters because of the way it is stored in that. this method is stored only the metadata in HDFS which takes less memory during runtime compare to big data due to the volume of data stored in HDFS. This approach is reduces the CPU utilization and memory allocation of the resource manager in Hadoop Framework and imroves data processing speed, such a way that time delay has to be reduced with minimum latency.

Download Full-text

An Approach in Big Data Analytics to Improve the Velocity of Unstructured Data Using MapReduce

International Journal of System Dynamics Applications ◽

10.4018/ijsda.20211001oa22 ◽

2021 ◽

Vol 10 (4) ◽

pp. 0-0

Keyword(s):

Big Data ◽

Time Delay ◽

Data Processing ◽

Data Analytics ◽

Big Data Analytics ◽

Data Retrieval ◽

High Volume ◽

Minimum Latency ◽

Hadoop Clusters ◽

Search Index

Big Data Analytics is an innovative approach for extracting the data from a huge volume of data warehouse systems. It reveals the method to compress the high volume of data into clusters by MapReduce and HDFS. However, the data processing has taken more time for extract and store in Hadoop clusters. The proposed system deals with the challenges of time delay in shuffle phase of map-reduce due to scheduling and sequencing. For improving the speed of big data, this proposed work using the Compressed Elastic Search Index (CESI) and MapReduce-Based Next Generation Sequencing Approach (MRBNGSA). This approach helps to increase the speed of data retrieval from HDFS clusters because of the way it is stored in that. this method is stored only the metadata in HDFS which takes less memory during runtime compare to big data due to the volume of data stored in HDFS. This approach is reduces the CPU utilization and memory allocation of the resource manager in Hadoop Framework and imroves data processing speed, such a way that time delay has to be reduced with minimum latency.

Download Full-text

A parallelization model for performance characterization of Spark Big Data jobs on Hadoop clusters

Journal Of Big Data ◽

10.1186/s40537-021-00499-7 ◽

2021 ◽

Vol 8 (1) ◽

Author(s):

N. Ahmed ◽

Andre L. C. Barczak ◽

Mohammad A. Rashid ◽

Teo Susnjak

Keyword(s):

Big Data ◽

Empirical Data ◽

Performance Model ◽

Problem Size ◽

Parallel Performance ◽

Big Data Applications ◽

Proposed Model ◽

Performance Patterns ◽

Hadoop Clusters ◽

Hadoop Cluster

AbstractThis article proposes a new parallel performance model for different workloads of Spark Big Data applications running on Hadoop clusters. The proposed model can predict the runtime for generic workloads as a function of the number of executors, without necessarily knowing how the algorithms were implemented. For a certain problem size, it is shown that a model based on serial boundaries for a 2D arrangement of executors can fit the empirical data for various workloads. The empirical data was obtained from a real Hadoop cluster, using Spark and HiBench. The workloads used in this work were included WordCount, SVM, Kmeans, PageRank and Graph (Nweight). A particular runtime pattern emerged when adding more executors to run a job. For some workloads, the runtime was longer with more executors added. This phenomenon is predicted with the new model of parallelisation. The resulting equation from the model explains certain performance patterns that do not fit Amdahl’s law predictions, nor Gustafson’s equation. The results show that the proposed model achieved the best fit with all workloads and most of the data sizes, using the R-squared metric for the accuracy of the fitting of empirical data. The proposed model has advantages over machine learning models due to its simplicity, requiring a smaller number of experiments to fit the data. This is very useful to practitioners in the area of Big Data because they can predict runtime of specific applications by analysing the logs. In this work, the model is limited to changes in the number of executors for a fixed problem size.

Download Full-text

Integrated Geo Cloud Solution for Seismic Data Processing

INFORMATION TECHNOLOGY IN INDUSTRY ◽

10.17762/itii.v9i2.392 ◽

2021 ◽

Vol 9 (2) ◽

pp. 589-604

Author(s):

Pruthvi Raj Venkatesh, Et. al.

Keyword(s):

Data Processing ◽

Data Storage ◽

Seismic Data ◽

Seismic Data Processing ◽

Cloud Platform ◽

Commercial Off The Shelf ◽

Hadoop Clusters ◽

Processing And Storage ◽

And Storage ◽

Very High

Oil industries generate an enormous volume of digitized data (e.g., seismic data) as a part of their seismic study and move it to the cloud for downstream applications. Moving massive data into the cloud can pose many challenges, especially to Commercial-off-the-shelf geoscience applications as they require very high compute and disk throughput. This paper proposes a digital transformation framework for efficient seismic data processing and storage comprising of: (a) Novel Data storage options, (b) Cloud-based HPC framework for efficient seismic data processing, and (c) MD5 hash calculation using the MapReduce pattern with Hadoop clusters. Azure cloud platform is used to validate the proposed framework and compare it with the existing process. Experimental results show a significant improvement in execution time, throughput, efficiency, and cost. The proposed framework can be used in any domain which deals with extensive data requiring high compute and throughput.

Download Full-text

Blockchain Based Authentication Framework for Kerberos Enabled Hadoop Clusters

10.1007/978-981-16-2709-5_24 ◽

2021 ◽

pp. 315-327

Author(s):

M. Hena ◽

N. Jeyanthi

Keyword(s):

Hadoop Clusters

Download Full-text

Application and Storage-Aware Data Placement and Job Scheduling for Hadoop Clusters

Journal of Circuits System and Computers ◽

10.1142/s0218126620502540 ◽

2020 ◽

Vol 29 (16) ◽

pp. 2050254

Author(s):

Tao Li ◽

Shuibing He ◽

Ping Chen ◽

Siling Yang ◽

Yanlong Yin ◽

...

Keyword(s):

Completion Time ◽

Large Scale ◽

Job Scheduling ◽

Data Placement ◽

Storage Device ◽

Storage Devices ◽

Average Completion Time ◽

Hadoop Clusters ◽

And Storage ◽

Existing Data

As one of the most popular frameworks for large-scale analytics processing, Hadoop is facing two challenges: both applications and storage devices become heterogeneous. However, existing data placement and job scheduling schemes pay little attention to such heterogeneity of either application I/O requirements or I/O device capability, thus can greatly degrade system efficiencies. In this paper, we propose ASPS, an Application and Storage-aware data Placement and job Scheduling approach for Hadoop clusters. The idea is to place application data and schedule application tasks considering both application I/O requirements and storage device characteristics. Specifically, ASPS first introduces novel metrics to quantify I/O requirements of applications. Then, based on the quantification, ASPS places data of different applications to the preferred storage devices. Finally, ASPS tries to launch jobs with high I/O requirements on the nodes with the same type of faster devices to improve system efficiency. We have implemented ASPS in Hadoop framework. Experimental results show that ASPS can reduce the completion time of a single application by up to 36% and the average completion time of six concurrent applications by 27%, compared to existing data placement policies and job scheduling approaches.

Download Full-text

A Novel Snake-like Data Placement Policy for Improving Video Processing in Hadoop Clusters

2020 IEEE Conference on Big Data and Analytics (ICBDA) ◽

10.1109/icbda50157.2020.9289799 ◽

2020 ◽

Author(s):

Eihab SaatiAlsoruji

Keyword(s):

Video Processing ◽

Data Placement ◽

Placement Policy ◽

Hadoop Clusters

Download Full-text

Elastic Provisioning of Hadoop Clusters on OpenStack Private Cloud

2020 11th International Conference on Computing, Communication and Networking Technologies (ICCCNT) ◽

10.1109/icccnt49239.2020.9225277 ◽

2020 ◽

Author(s):

Namrata Hosamani ◽

Nageshwar Albur ◽

Prajna Yaji ◽

Mohammed Moin Mulla ◽

D.G. Narayan

Keyword(s):

Private Cloud ◽

Hadoop Clusters

Download Full-text

hadoop clusters
Recently Published Documents

TOTAL DOCUMENTS

H-INDEX

Eoulsan 2: an efficient workflow manager for reproducible bulk, long-read and single-cell transcriptomics analyses

An Approach in Big Data Analytics to Improve the Velocity of Unstructured Data Using MapReduce

An Approach in Big Data Analytics to improve the velocity of unstructured data Using Map Reduce

An Approach in Big Data Analytics to Improve the Velocity of Unstructured Data Using MapReduce

A parallelization model for performance characterization of Spark Big Data jobs on Hadoop clusters

Integrated Geo Cloud Solution for Seismic Data Processing

Blockchain Based Authentication Framework for Kerberos Enabled Hadoop Clusters

Application and Storage-Aware Data Placement and Job Scheduling for Hadoop Clusters

A Novel Snake-like Data Placement Policy for Improving Video Processing in Hadoop Clusters

Elastic Provisioning of Hadoop Clusters on OpenStack Private Cloud

Export Citation Format

hadoop clustersRecently Published Documents

TOTAL DOCUMENTS

H-INDEX

Eoulsan 2: an efficient workflow manager for reproducible bulk, long-read and single-cell transcriptomics analyses

An Approach in Big Data Analytics to Improve the Velocity of Unstructured Data Using MapReduce

An Approach in Big Data Analytics to improve the velocity of unstructured data Using Map Reduce

An Approach in Big Data Analytics to Improve the Velocity of Unstructured Data Using MapReduce

A parallelization model for performance characterization of Spark Big Data jobs on Hadoop clusters

Integrated Geo Cloud Solution for Seismic Data Processing

Blockchain Based Authentication Framework for Kerberos Enabled Hadoop Clusters

Application and Storage-Aware Data Placement and Job Scheduling for Hadoop Clusters

A Novel Snake-like Data Placement Policy for Improving Video Processing in Hadoop Clusters

Elastic Provisioning of Hadoop Clusters on OpenStack Private Cloud

hadoop clusters
Recently Published Documents