Performance-efficient Recommendation and Prediction Service for Big Data frameworks focusing on Data Compression and In-memory Data Storage Indicators

The MapReduce framework manages Big Data sets by splitting the large datasets into a set of distributed blocks and processes them in parallel. Data compression and in-memory file systems are widely used methods in Big Data processing to reduce resource-intensive I/O operations and improve I/O rate correspondingly. The article presents a performance-efficient modular and configurable decision-making robust service relying on data compression and in-memory data storage indicators. The service consists of Recommendation and Prediction modules, predicts the execution time of a given job based on metrics, and recommends the best configuration parameters to improve Hadoop and Spark frameworks' performance. Several CPU and data-intensive applications and micro-benchmarks have been evaluated to improve the performance, including Log Analyzer, WordCount, and K-Means.

Download Full-text

NoSQL Databases

Advances in Data Mining and Database Management - Handbook of Research on Cloud Infrastructures for Big Data Analytics ◽

10.4018/978-1-4666-5864-6.ch008 ◽

2014 ◽

pp. 186-215 ◽

Cited By ~ 2

Author(s):

Ganesh Chandra Deka

Keyword(s):

Cloud Computing ◽

Big Data ◽

Data Processing ◽

Open Source ◽

Data Storage ◽

Big Data Processing ◽

Nosql Databases ◽

Data Intensive ◽

Huge Data ◽

Data Intensive Applications

NoSQL databases are designed to meet the huge data storage requirements of cloud computing and big data processing. NoSQL databases have lots of advanced features in addition to the conventional RDBMS features. Hence, the “NoSQL” databases are popularly known as “Not only SQL” databases. A variety of NoSQL databases having different features to deal with exponentially growing data-intensive applications are available with open source and proprietary option. This chapter discusses some of the popular NoSQL databases and their features on the light of CAP theorem.

Download Full-text

ScaDS Dresden/Leipzig – A competence center for collaborative big data research

it - Information Technology ◽

10.1515/itit-2018-0026 ◽

2018 ◽

Vol 60 (5-6) ◽

pp. 327-333 ◽

Cited By ~ 1

Author(s):

René Jäkel ◽

Eric Peukert ◽

Wolfgang E. Nagel ◽

Erhard Rahm

Keyword(s):

Big Data ◽

Heterogeneous Data ◽

Data Sets ◽

Data Intensive ◽

Innovative Methods ◽

Huge Data ◽

Wide Range ◽

Resource Requirements ◽

Visualization Of Data ◽

Data Intensive Applications

Abstract The efficient and intelligent handling of large, often distributed and heterogeneous data sets increasingly determines the scientific and economic competitiveness in most application areas. Mobile applications, social networks, multimedia collections, sensor networks, data intense scientific experiments, and complex simulations nowadays generate a huge data deluge. Nonetheless, processing and analyzing these data sets with innovative methods open up new opportunities for its exploitation and new insights. Nevertheless, the resulting resource requirements exceed usually the possibilities of state-of-the-art methods for the acquisition, integration, analysis and visualization of data and are summarized under the term big data. ScaDS Dresden/Leipzig, as one Germany-wide competence center for collaborative big data research, bundles efforts to realize data-intensive applications for a wide range of applications in science and industry. In this article, we present the basic concept of the competence center and give insights in some of its research topics.

Download Full-text

Intelligent Secure Storage Mechanism for Big Data

Webology ◽

10.14704/web/v18si01/web18057 ◽

2021 ◽

Vol 18 (Special Issue 01) ◽

pp. 246-261

Author(s):

K.R. Remesh Babu ◽

K.P. Madhu

Keyword(s):

Big Data ◽

Data Storage ◽

Big Data Analytics ◽

Business Organizations ◽

Storage Mechanism ◽

Data Intensive ◽

Secure Storage ◽

Huge Data ◽

Efficient Data ◽

Data Intensive Applications

The management of big data became more important due to the wide spread adoption of internet of things in various fields. The developments in technology, science, human habits, etc., generates massive amount of data, so it is increasingly important to store and protect these data from attacks. Big data analytics is now a hot topic. The data storage facility provided by the cloud computing enabled business organizations to overcome the burden of huge data storage and maintenance. Also, several distributed cloud applications supports them to analyze this data for taking appropriate decisions. The dynamic growth of data and data intensive applications demands an efficient intelligent storage mechanism for big data. The proposed system analyzes IP packets for vulnerabilities and classifies data nodes as reliable and unreliable nodes for the efficient data storage. The proposed Apriori algorithm based method automatically classifies the nodes for intelligent secure storage mechanism for the distributed big data storage.

Download Full-text

Task Selection for Scheduling using Hadoop Scheduler

International Journal of Innovative Technology and Exploring Engineering - Special Issue ◽

10.35940/ijitee.b1020.1292s19 ◽

2019 ◽

Vol 9 (2S) ◽

pp. 708-710

Keyword(s):

Big Data ◽

Parallel Programming ◽

Data Storage ◽

Selection Method ◽

Experimental Result ◽

Task Selection ◽

Data Intensive ◽

Selection For ◽

Data Intensive Applications ◽

Big Data Storage

MapReduce is a prevalent model for data intensive applications. This covers the difficulties of parallel programming and provides an abstract environment. Hadoop is a benchmark for Big Data storage by being able to provide load balancing, scalable and fault tolerance operation. Hadoop output is mainly dependent on scheduler. Various algorithms for scheduling [6-10]have been suggested for various types of environments, applications and workload. In this work new task selection method is developed to facilitate the scheduler, if a node has several local tasks. Experimental result shows an improvement of 20% in respect of locality and fairness.

Download Full-text

Performance Evaluations of Distributed File Systems for Scientific Big Data in FUSE Environment

Electronics ◽

10.3390/electronics10121471 ◽

2021 ◽

Vol 10 (12) ◽

pp. 1471

Author(s):

Jun-Yeong Lee ◽

Moon-Hyun Kim ◽

Syed Asif Raza Raza Shah ◽

Sang-Un Ahn ◽

Heejun Yoon ◽

...

Keyword(s):

Data Storage ◽

Scale Up ◽

File Systems ◽

Performance Evaluations ◽

Distributed File Systems ◽

Data Intensive Computing ◽

Data Intensive ◽

Tremendous Amount ◽

Computing Environments ◽

And Performance

Data are important and ever growing in data-intensive scientific environments. Such research data growth requires data storage systems that play pivotal roles in data management and analysis for scientific discoveries. Redundant Array of Independent Disks (RAID), a well-known storage technology combining multiple disks into a single large logical volume, has been widely used for the purpose of data redundancy and performance improvement. However, this requires RAID-capable hardware or software to build up a RAID-enabled disk array. In addition, it is difficult to scale up the RAID-based storage. In order to mitigate such a problem, many distributed file systems have been developed and are being actively used in various environments, especially in data-intensive computing facilities, where a tremendous amount of data have to be handled. In this study, we investigated and benchmarked various distributed file systems, such as Ceph, GlusterFS, Lustre and EOS for data-intensive environments. In our experiment, we configured the distributed file systems under a Reliable Array of Independent Nodes (RAIN) structure and a Filesystem in Userspace (FUSE) environment. Our results identify the characteristics of each file system that affect the read and write performance depending on the features of data, which have to be considered in data-intensive computing environments.

Download Full-text

Using Hadoop Technology to Overcome Big Data Problems by Choosing Proposed Cost-efficient Scheduler Algorithm for Heterogeneous Hadoop System (BD3)

Journal of Scientific Research and Reports ◽

10.9734/jsrr/2020/v26i930310 ◽

2020 ◽

pp. 58-84

Author(s):

Abou_el_ela Abdou Hussein

Keyword(s):

Big Data ◽

Data Processing ◽

Data Storage ◽

Database Management System ◽

Data Sets ◽

Complex Data ◽

Daily Data ◽

Complex Data Sets ◽

Cost Efficient ◽

Hadoop System

Day by day advanced web technologies have led to tremendous growth amount of daily data generated volumes. This mountain of huge and spread data sets leads to phenomenon that called big data which is a collection of massive, heterogeneous, unstructured, enormous and complex data sets. Big Data life cycle could be represented as, Collecting (capture), storing, distribute, manipulating, interpreting, analyzing, investigate and visualizing big data. Traditional techniques as Relational Database Management System (RDBMS) couldn’t handle big data because it has its own limitations, so Advancement in computing architecture is required to handle both the data storage requisites and the weighty processing needed to analyze huge volumes and variety of data economically. There are many technologies manipulating a big data, one of them is hadoop. Hadoop could be understand as an open source spread data processing that is one of the prominent and well known solutions to overcome handling big data problem. Apache Hadoop was based on Google File System and Map Reduce programming paradigm. Through this paper we dived to search for all big data characteristics starting from first three V's that have been extended during time through researches to be more than fifty six V's and making comparisons between researchers to reach to best representation and the precise clarification of all big data V’s characteristics. We highlight the challenges that face big data processing and how to overcome these challenges using Hadoop and its use in processing big data sets as a solution for resolving various problems in a distributed cloud based environment. This paper mainly focuses on different components of hadoop like Hive, Pig, and Hbase, etc. Also we institutes absolute description of Hadoop Pros and cons and improvements to face hadoop problems by choosing proposed Cost-efficient Scheduler Algorithm for heterogeneous Hadoop system.

Download Full-text

Empirical Performance Analysis of HPC Benchmarks Across Variations in Cloud Computing

International Journal of Cloud Applications and Computing ◽

10.4018/ijcac.2013010102 ◽

2013 ◽

Vol 3 (1) ◽

pp. 13-26 ◽

Cited By ~ 4

Author(s):

Sanjay P. Ahuja ◽

Sindhu Mani

Keyword(s):

Data Storage ◽

High Performance ◽

Large Data ◽

Extensive Study ◽

Memory Bandwidth ◽

Platform As A Service ◽

Data Intensive ◽

Computational Performance ◽

Empirical Performance ◽

Data Intensive Applications

High Performance Computing (HPC) applications are scientific applications that require significant CPU capabilities. They are also data-intensive applications requiring large data storage. While many researchers have examined the performance of Amazon’s EC2 platform across some HPC benchmarks, an extensive study and their comparison between Amazon’s EC2 and Microsoft’s Windows Azure is largely missing with metrics such as memory bandwidth, I/O performance, and communication and computational performance. The purpose of this paper is to implement existing benchmarks to evaluate and analyze these metrics for EC2 and Windows Azure that span both Infrastructure-as-a-Service and Platform-as-a-Service types. This was accomplished by running MPI versions of STREAM, Interleaved or Random (IOR) and NAS Parallel (NPB) benchmarks on small and medium instance types. In addition a new EC2 medium instance type (m1.medium) was also included in the analysis. These benchmarks measure the memory bandwidth, I/O performance, communication and computational performance.

Download Full-text