hadoop clusters
Recently Published Documents


TOTAL DOCUMENTS

134
(FIVE YEARS 35)

H-INDEX

15
(FIVE YEARS 2)

2021 ◽  
Author(s):  
Nathalie Lehmann ◽  
Sandrine Perrin ◽  
Claire Wallon ◽  
Xavier Bauquet ◽  
Vivien Deshaies ◽  
...  

Motivation: Core sequencing facilities produce huge amounts of sequencing data that need to be analysed with automated workflows to ensure reproducibility and traceability. Eoulsan is a versatile open-source workflow engine meeting the needs of core facilities, by automating the analysis of a large number of samples. Its core design separates the description of the workflow from the actual commands to be run. This originality simplifies its usage as the user does not need to handle code, while ensuring reproducibility. Eoulsan was initially developed for bulk RNA-seq data, but the transcriptomics applications have recently widened with the advent of long-read sequencing and single-cell technologies, calling for the development of new workflows. Result: We present Eoulsan 2, a major update that (i) enhances the workflow manager itself, (ii) facilitates the development of new modules, and (iii) expands its applications to long reads RNA-seq (Oxford Nanopore Technologies) and scRNA-seq (Smart-seq2 and 10x Genomics). The workflow manager has been rewritten, with support for execution on a larger choice of computational infrastructure (workstations, Hadoop clusters, and various job schedulers for cluster usage). Eoulsan now facilitates the development of new modules, by reusing wrappers developed for the Galaxy platform, with support for container images (Docker or Singularity) packaging tools to execute. Finally, Eoulsan natively integrates novel modules for bulk RNA-seq, as well as others specifically designed for processing long read RNA-seq and scRNA-seq. Eoulsan 2 is distributed with ready-to-use workflows and companion tutorials. Availability and implementation: Eoulsan is implemented in Java, supported on Linux systems and distributed under the LGPL and CeCILL-C licenses at: http://outils.genomique.biologie.ens.fr/eoulsan/. The source code and sample workflows are available on GitHub: https://github.com/GenomicParisCentre/eoulsan. A GitHub repository for modules using the Galaxy tool XML syntax is further provided at: https://github.com/GenomicParisCentre/eoulsan-tools


2021 ◽  
Vol 10 (4) ◽  
pp. 1-25
Author(s):  
Sundarakumar M. R. ◽  
Mahadevan G. ◽  
Ramasubbareddy Somula ◽  
Sankar Sennan ◽  
Bharat S. Rawal

Big Data Analytics is an innovative approach for extracting the data from a huge volume of data warehouse systems. It reveals the method to compress the high volume of data into clusters by MapReduce and HDFS. However, the data processing has taken more time for extract and store in Hadoop clusters. The proposed system deals with the challenges of time delay in shuffle phase of map-reduce due to scheduling and sequencing. For improving the speed of big data, this proposed work using the Compressed Elastic Search Index (CESI) and MapReduce-Based Next Generation Sequencing Approach (MRBNGSA). This approach helps to increase the speed of data retrieval from HDFS clusters because of the way it is stored in that. this method is stored only the metadata in HDFS which takes less memory during runtime compare to big data due to the volume of data stored in HDFS. This approach is reduces the CPU utilization and memory allocation of the resource manager in Hadoop Framework and imroves data processing speed, such a way that time delay has to be reduced with minimum latency.


2021 ◽  
Vol 10 (4) ◽  
pp. 0-0

Big Data Analytics is an innovative approach for extracting the data from a huge volume of data warehouse systems. It reveals the method to compress the high volume of data into clusters by MapReduce and HDFS. However, the data processing has taken more time for extract and store in Hadoop clusters. The proposed system deals with the challenges of time delay in shuffle phase of map-reduce due to scheduling and sequencing. For improving the speed of big data, this proposed work using the Compressed Elastic Search Index (CESI) and MapReduce-Based Next Generation Sequencing Approach (MRBNGSA). This approach helps to increase the speed of data retrieval from HDFS clusters because of the way it is stored in that. this method is stored only the metadata in HDFS which takes less memory during runtime compare to big data due to the volume of data stored in HDFS. This approach is reduces the CPU utilization and memory allocation of the resource manager in Hadoop Framework and imroves data processing speed, such a way that time delay has to be reduced with minimum latency.


2021 ◽  
Vol 10 (4) ◽  
pp. 0-0

Big Data Analytics is an innovative approach for extracting the data from a huge volume of data warehouse systems. It reveals the method to compress the high volume of data into clusters by MapReduce and HDFS. However, the data processing has taken more time for extract and store in Hadoop clusters. The proposed system deals with the challenges of time delay in shuffle phase of map-reduce due to scheduling and sequencing. For improving the speed of big data, this proposed work using the Compressed Elastic Search Index (CESI) and MapReduce-Based Next Generation Sequencing Approach (MRBNGSA). This approach helps to increase the speed of data retrieval from HDFS clusters because of the way it is stored in that. this method is stored only the metadata in HDFS which takes less memory during runtime compare to big data due to the volume of data stored in HDFS. This approach is reduces the CPU utilization and memory allocation of the resource manager in Hadoop Framework and imroves data processing speed, such a way that time delay has to be reduced with minimum latency.


2021 ◽  
Vol 8 (1) ◽  
Author(s):  
N. Ahmed ◽  
Andre L. C. Barczak ◽  
Mohammad A. Rashid ◽  
Teo Susnjak

AbstractThis article proposes a new parallel performance model for different workloads of Spark Big Data applications running on Hadoop clusters. The proposed model can predict the runtime for generic workloads as a function of the number of executors, without necessarily knowing how the algorithms were implemented. For a certain problem size, it is shown that a model based on serial boundaries for a 2D arrangement of executors can fit the empirical data for various workloads. The empirical data was obtained from a real Hadoop cluster, using Spark and HiBench. The workloads used in this work were included WordCount, SVM, Kmeans, PageRank and Graph (Nweight). A particular runtime pattern emerged when adding more executors to run a job. For some workloads, the runtime was longer with more executors added. This phenomenon is predicted with the new model of parallelisation. The resulting equation from the model explains certain performance patterns that do not fit Amdahl’s law predictions, nor Gustafson’s equation. The results show that the proposed model achieved the best fit with all workloads and most of the data sizes, using the R-squared metric for the accuracy of the fitting of empirical data. The proposed model has advantages over machine learning models due to its simplicity, requiring a smaller number of experiments to fit the data. This is very useful to practitioners in the area of Big Data because they can predict runtime of specific applications by analysing the logs. In this work, the model is limited to changes in the number of executors for a fixed problem size.


2021 ◽  
Vol 9 (2) ◽  
pp. 589-604
Author(s):  
Pruthvi Raj Venkatesh, Et. al.

Oil industries generate an enormous volume of digitized data (e.g., seismic data) as a part of their seismic study and move it to the cloud for downstream applications. Moving massive data into the cloud can pose many challenges, especially to Commercial-off-the-shelf geoscience applications as they require very high compute and disk throughput. This paper proposes a digital transformation framework for efficient seismic data processing and storage comprising of: (a) Novel Data storage options, (b) Cloud-based HPC framework for efficient seismic data processing, and (c) MD5 hash calculation using the MapReduce pattern with Hadoop clusters. Azure cloud platform is used to validate the proposed framework and compare it with the existing process. Experimental results show a significant improvement in execution time, throughput, efficiency, and cost. The proposed framework can be used in any domain which deals with extensive data requiring high compute and throughput.


2020 ◽  
Vol 29 (16) ◽  
pp. 2050254
Author(s):  
Tao Li ◽  
Shuibing He ◽  
Ping Chen ◽  
Siling Yang ◽  
Yanlong Yin ◽  
...  

As one of the most popular frameworks for large-scale analytics processing, Hadoop is facing two challenges: both applications and storage devices become heterogeneous. However, existing data placement and job scheduling schemes pay little attention to such heterogeneity of either application I/O requirements or I/O device capability, thus can greatly degrade system efficiencies. In this paper, we propose ASPS, an Application and Storage-aware data Placement and job Scheduling approach for Hadoop clusters. The idea is to place application data and schedule application tasks considering both application I/O requirements and storage device characteristics. Specifically, ASPS first introduces novel metrics to quantify I/O requirements of applications. Then, based on the quantification, ASPS places data of different applications to the preferred storage devices. Finally, ASPS tries to launch jobs with high I/O requirements on the nodes with the same type of faster devices to improve system efficiency. We have implemented ASPS in Hadoop framework. Experimental results show that ASPS can reduce the completion time of a single application by up to 36% and the average completion time of six concurrent applications by 27%, compared to existing data placement policies and job scheduling approaches.


Sign in / Sign up

Export Citation Format

Share Document