scholarly journals Study on Big Data Frameworks

Author(s):  
Adriano Fernandes ◽  
Jonathan Barretto ◽  
Jonas Fernandes

Big data analytics is becoming more and more popular every day as a tool for evaluating large volumes of data on demand. Apache Hadoop, Spark, Storm, and Flink are four of the most widely used big data processing frameworks. Although all four architectures support big data analysis, they vary in how they are used and the infrastructure that supports it. This paper defines a general collection of main performance metrics, which include Processing Time, CPU Use, Latency, Execution Time, Performance, Scalability, and Fault-tolerance, and contrasting the four big data architectures against these KPIs in a literature review. When compared to Apache Hadoop and Apache Storm frameworks for non-real-time results, Spark was found to be the winner over multiple KPIs, including processing time, CPU usage, Latency, Execution time, and Scalability. In terms of processing time, CPU consumption, latency, execution time, and performance, Flink surpassed Apache Spark and Apache Storm architectures.

Nowadays, large volume of data is generated in the form of text, voice, video, images and sound. It is very challenging job to handle and to get process these different types of data. It is very laborious process to analysis big data by using the traditional data processing applications. Due to huge scattered file systems, a big data analysis is a difficult task. So, to analyses the big data, a number of tools and techniques are required. Some of the techniques of data mining are used to analyze the big data such as clustering, prediction, and classification and decision tree etc. Apache Hadoop, Apache spark, Apache Storm, MongoDB, NOSQL, HPCC are the tools used to handle big data. This paper presents a review and comparative study of these tools and techniques which are basically used for Big Data analytics. A brief summary of tools and techniques is represented here.


2021 ◽  
Vol 9 (1) ◽  
pp. 16-44
Author(s):  
Weiqing Zhuang ◽  
Morgan C. Wang ◽  
Ichiro Nakamoto ◽  
Ming Jiang

Abstract Big data analytics (BDA) in e-commerce, which is an emerging field that started in 2006, deeply affects the development of global e-commerce, especially its layout and performance in the U.S. and China. This paper seeks to examine the relative influence of theoretical research of BDA in e-commerce to explain the differences between the U.S. and China by adopting a statistical analysis method on the basis of samples collected from two main literature databases, Web of Science and CNKI, aimed at the U.S. and China. The results of this study help clarify doubts regarding the development of China’s e-commerce, which exceeds that of the U.S. today, in view of the theoretical comparison of BDA in e-commerce between them.


Author(s):  
Mohd Imran ◽  
Mohd Vasim Ahamad ◽  
Misbahul Haque ◽  
Mohd Shoaib

The term big data analytics refers to mining and analyzing of the voluminous amount of data in big data by using various tools and platforms. Some of the popular tools are Apache Hadoop, Apache Spark, HBase, Storm, Grid Gain, HPCC, Casandra, Pig, Hive, and No SQL, etc. These tools are used depending on the parameter taken for big data analysis. So, we need a comparative analysis of such analytical tools to choose best and simpler way of analysis to gain more optimal throughput and efficient mining. This chapter contributes to a comparative study of big data analytics tools based on different aspects such as their functionality, pros, and cons based on characteristics that can be used to determine the best and most efficient among them. Through the comparative study, people are capable of using such tools in a more efficient way.


2022 ◽  
pp. 622-631
Author(s):  
Mohd Imran ◽  
Mohd Vasim Ahamad ◽  
Misbahul Haque ◽  
Mohd Shoaib

The term big data analytics refers to mining and analyzing of the voluminous amount of data in big data by using various tools and platforms. Some of the popular tools are Apache Hadoop, Apache Spark, HBase, Storm, Grid Gain, HPCC, Casandra, Pig, Hive, and No SQL, etc. These tools are used depending on the parameter taken for big data analysis. So, we need a comparative analysis of such analytical tools to choose best and simpler way of analysis to gain more optimal throughput and efficient mining. This chapter contributes to a comparative study of big data analytics tools based on different aspects such as their functionality, pros, and cons based on characteristics that can be used to determine the best and most efficient among them. Through the comparative study, people are capable of using such tools in a more efficient way.


2018 ◽  
Vol 7 (4.5) ◽  
pp. 485
Author(s):  
Samson Fadiya ◽  
Arif Sari

The adoption of Web 2.0 technologies, Internet of Things, etc. by individuals and organization has led to an explosion of data. As it stands, existing Relational Database Management Systems (RDBMSs) are incapable of handling this deluge of data. The term Big Data was coined to represent these vast, fast and complex datasets that regular RDBMSs could not handle. Special tools or frameworks were developed to deal with processing, managing and storing this big data. These tools are capable of functioning in distributed industry- standard environments thereby maintaining efficiency and effectiveness at a business level. Apache Hadoop is an example of such a framework. This report discusses big data, it origins, opportunities and challenges that it presents, big data analytics and the application of big data using existing big data tools or frameworks. It also discusses Apache Hadoop as a big data framework and provides a basic overview of this technology from technological and business perspectives.  


2020 ◽  
Vol 34 (28) ◽  
pp. 2050311
Author(s):  
Satvik Vats ◽  
B. B. Sagar

In Big data domain, platform dependency can alter the behavior of the business. It is because of the different kinds (Structured, Semi-structured and Unstructured) and characteristics of the data. By the traditional infrastructure, different kinds of data cannot be processed simultaneously due to their platform dependency for a particular task. Therefore, the responsibility of selecting suitable tools lies with the user. The variety of data generated by different sources requires the selection of suitable tools without human intervention. Further, these tools also face the limitation of recourses to deal with a large volume of data. This limitation of resources affects the performance of the tools in terms of execution time. Therefore, in this work, we proposed a model in which different data analytics tools share a common infrastructure to provide data independence and resource sharing environment, i.e. the proposed model shares common (Hybrid) Hadoop Distributed File System (HDFS) between three Name-Node (Master Node), three Data-Node and one Client-node, which works under the DeMilitarized zone (DMZ). To realize this model, we have implemented Mahout, R-Hadoop and Splunk sharing a common HDFS. Further using our model, we run [Formula: see text]-means clustering, Naïve Bayes and recommender algorithms on three different datasets, movie rating, newsgroup, and Spam SMS dataset, representing structured, semi-structured and unstructured, respectively. Our model selected the appropriate tool, e.g. Mahout to run on the newsgroup dataset as other tools cannot run on this data. This shows that our model provides data independence. Further results of our proposed model are compared with the legacy (individual) model in terms of execution time and scalability. The improved performance of the proposed model establishes the hypothesis that our model overcomes the limitation of the resources of the legacy model.


Author(s):  
Anirban Mukherjee ◽  
Joydip Datta ◽  
Raghavendra Jorapur ◽  
Ravi Singhvi ◽  
Saurav Haloi ◽  
...  

2019 ◽  
Vol 57 (8) ◽  
pp. 1993-2009 ◽  
Author(s):  
Lorenzo Ardito ◽  
Veronica Scuotto ◽  
Manlio Del Giudice ◽  
Antonio Messeni Petruzzelli

Purpose The purpose of this paper is to scrutinize and classify the literature linking Big Data analytics and management phenomena. Design/methodology/approach An objective bibliometric analysis is conducted, supported by subjective assessments based on the studies focused on the intertwining of Big Data analytics and management fields. Specifically, deeper descriptive statistics and document co-citation analysis are provided. Findings From the document co-citation analysis and its evaluation, four clusters depicting literature linking Big Data analytics and management phenomena are revealed: theoretical development of Big Data analytics; management transition to Big Data analytics; Big Data analytics and firm resources, capabilities and performance; and Big Data analytics for supply chain management. Originality/value To the best of the authors’ knowledge, this is one of the first attempts to comprehend the research streams which, over time, have paved the way to the intersection between Big Data analytics and management fields.


Sign in / Sign up

Export Citation Format

Share Document