Big Data Query Optimization -Literature Survey

Big data refers to data sets whose size is beyond the capabilities of most current hardware and software technologies. The Apache Hadoop software library is a framework for distributed processing of large data sets, while HDFS is a distributed file system that provides high-throughput access to data-driven applications, and MapReduce is software framework for distributed computing of large data sets. Huge collections of raw data require fast and accurate mining processes in order to extract useful knowledge. One of the most popular techniques of data mining is the K-means clustering algorithm. In this study, the authors develop a distributed version of the K-means algorithm using the MapReduce framework on the Hadoop Distributed File System. The theoretical and experimental results of the technique prove its efficiency; thus, HDFS and MapReduce can apply to big data with very promising results.

Download Full-text

Exploiting soft and hard correlations in big data query optimization

Proceedings of the VLDB Endowment ◽

10.14778/2994509.2994519 ◽

2016 ◽

Vol 9 (12) ◽

pp. 1005-1016 ◽

Cited By ~ 7

Author(s):

Hai Liu ◽

Dongqing Xiao ◽

Pankaj Didwania ◽

Mohamed Y. Eltabakh

Keyword(s):

Big Data ◽

Query Optimization ◽

Data Query

Download Full-text

A Big Data Query Optimization Framework for Telecom Customer Churn Analysis

10.1007/978-981-16-2597-8_40 ◽

2021 ◽

pp. 475-484

Author(s):

Aarti Chugh ◽

Vivek Kumar Sharma ◽

Manjot Kaur Bhatia ◽

Charu Jain

Keyword(s):

Big Data ◽

Query Optimization ◽

Customer Churn ◽

Optimization Framework ◽

Data Query ◽

Churn Analysis

Download Full-text

A Review on Latest Technologies in Big Data Analysis

International Journal of Engineering & Technology ◽

10.14419/ijet.v7i3.1.16806 ◽

2018 ◽

Vol 7 (3.1) ◽

pp. 93

Author(s):

Castro S ◽

Pushpalakshmi R

Keyword(s):

Big Data ◽

Digital Technologies ◽

Large Data ◽

Big Data Analysis ◽

Digital World ◽

Huge Data ◽

Big Data Applications ◽

Research Problems ◽

Modern Information ◽

Multiple Stages

In this digital world, the modern information systems have produced a large amount of data which needs huge depositary in terms of terabytes for storage. Some of the digital technologies such as cloud computing and Internet of Things (IoT) are considered as the major sources of such large data. It is necessary to extract knowledge by analyzing these huge data which needs several attempts at multiple stages for decision making. Thus, the recent researches have focused on the analysis of big data. The main aim of this paper is to investigate the challenges of big data, applications, opportunities, implantation tools and its research problems. Thus, this study presents a platform to investigate big data at various levels. Moreover, it initiates a novel perspective for researchers to provide the solutions according to the challenges and research problems.

Download Full-text

A Benchmark for Suitability of Alluxio over Spark

International Journal of Innovative Technology and Exploring Engineering - Special Issue ◽

10.35940/ijitee.a8190.1110120 ◽

2020 ◽

Vol 10 (1) ◽

pp. 245-250

Keyword(s):

Big Data ◽

Data Processing ◽

Data Storage ◽

Storage Systems ◽

Distributed Storage ◽

Storage System ◽

Large Data ◽

Time Data ◽

Big Data Applications ◽

Access To Data

Big data applications play an important role in real time data processing. Apache Spark is a data processing framework with in-memory data engine that quickly processes large data sets. It can also distribute data processing tasks across multiple computers, either on its own or in tandem with other distributed computing tools. Spark’s in-memory processing cannot share data between the applications and hence, the RAM memory will be insufficient for storing petabytes of data. Alluxio is a virtual distributed storage system that leverages memory for data storage and provides faster access to data in different storage systems. Alluxio helps to speed up data intensive Spark applications, with various storage systems. In this work, the performance of applications on Spark as well as Spark running over Alluxio have been studied with respect to several storage formats such as Parquet, ORC, CSV, and JSON; and four types of queries from Star Schema Benchmark (SSB). A benchmark is evolved to suggest the suitability of Spark Alluxio combination for big data applications. It is found that Alluxio is suitable for applications that use databases of size more than 2.6 GB storing data in JSON and CSV formats. Spark is found suitable for applications that use storage formats such as parquet and ORC with database sizes less than 2.6GB.

Download Full-text

Research on Big Data Query Optimization Method of Power System Substation Equipment Condition Monitoring

10.1109/icpsasia52756.2021.9621504 ◽

2021 ◽

Author(s):

Lixia Wang ◽

Dawei Wang ◽

Wei Li

Keyword(s):

Big Data ◽

Power System ◽

Condition Monitoring ◽

Query Optimization ◽

Optimization Method ◽

Data Query ◽

Substation Equipment

Download Full-text

Intelligent Management and Efficient Operation of Big Data

Handbook of Research on Trends and Future Directions in Big Data and Web Intelligence - Advances in Data Mining and Database Management ◽

10.4018/978-1-4666-8505-5.ch006 ◽

2015 ◽

pp. 105-129 ◽

Cited By ~ 1

Author(s):

José Moura ◽

Fernando Batista ◽

Elsa Cardoso ◽

Luís Nunes

Keyword(s):

Big Data ◽

Large Data ◽

Data Sources ◽

Data Traffic ◽

Efficient Operation ◽

Big Data Applications ◽

Intelligent Management ◽

High Level ◽

Cloud Infrastructures

This chapter details how Big Data can be used and implemented in networking and computing infrastructures. Specifically, it addresses three main aspects: the timely extraction of relevant knowledge from heterogeneous, and very often unstructured large data sources; the enhancement on the performance of processing and networking (cloud) infrastructures that are the most important foundational pillars of Big Data applications or services; and novel ways to efficiently manage network infrastructures with high-level composed policies for supporting the transmission of large amounts of data with distinct requisites (video vs. non-video). A case study involving an intelligent management solution to route data traffic with diverse requirements in a wide area Internet Exchange Point is presented, discussed in the context of Big Data, and evaluated.

Download Full-text

Adaptive correlation exploitation in big data query optimization

The VLDB Journal ◽

10.1007/s00778-018-0515-8 ◽

2018 ◽

Vol 27 (6) ◽

pp. 873-898 ◽

Cited By ~ 2

Author(s):

Yuchen Liu ◽

Hai Liu ◽

Dongqing Xiao ◽

Mohamed Y. Eltabakh

Keyword(s):

Big Data ◽

Query Optimization ◽

Data Query

Download Full-text

Intelligent Management and Efficient Operation of Big Data

Web Services ◽

10.4018/978-1-5225-7501-6.ch102 ◽

2019 ◽

pp. 1991-2016

Author(s):

José Moura ◽

Fernando Batista ◽

Elsa Cardoso ◽

Luís Nunes

Keyword(s):

Big Data ◽

Large Data ◽

Data Sources ◽

Data Traffic ◽

Efficient Operation ◽

Big Data Applications ◽

Intelligent Management ◽

High Level ◽

Cloud Infrastructures

This chapter details how Big Data can be used and implemented in networking and computing infrastructures. Specifically, it addresses three main aspects: the timely extraction of relevant knowledge from heterogeneous, and very often unstructured large data sources; the enhancement on the performance of processing and networking (cloud) infrastructures that are the most important foundational pillars of Big Data applications or services; and novel ways to efficiently manage network infrastructures with high-level composed policies for supporting the transmission of large amounts of data with distinct requisites (video vs. non-video). A case study involving an intelligent management solution to route data traffic with diverse requirements in a wide area Internet Exchange Point is presented, discussed in the context of Big Data, and evaluated.

Download Full-text

Applying the K-Means Algorithm in Big Raw Data Sets with Hadoop and MapReduce

Big Data Management, Technologies, and Applications - Advances in Data Mining and Database Management ◽

10.4018/978-1-4666-4699-5.ch002 ◽

2013 ◽

pp. 23-46

Author(s):

Ilias K. Savvas ◽

Georgia N. Sofianidou ◽

M-Tahar Kechadi

Keyword(s):

Big Data ◽

Clustering Algorithm ◽

File System ◽

Large Data ◽

Large Data Sets ◽

Distributed File System ◽

Data Sets ◽

Raw Data ◽

Hadoop Distributed File System ◽

Access To Data

Big data refers to data sets whose size is beyond the capabilities of most current hardware and software technologies. The Apache Hadoop software library is a framework for distributed processing of large data sets, while HDFS is a distributed file system that provides high-throughput access to data-driven applications, and MapReduce is software framework for distributed computing of large data sets. Huge collections of raw data require fast and accurate mining processes in order to extract useful knowledge. One of the most popular techniques of data mining is the K-means clustering algorithm. In this study, the authors develop a distributed version of the K-means algorithm using the MapReduce framework on the Hadoop Distributed File System. The theoretical and experimental results of the technique prove its efficiency; thus, HDFS and MapReduce can apply to big data with very promising results.

Download Full-text