An Optimal Data Placement Strategy for Improving System Performance of Massive Data Applications Using Graph Clustering

This article describes how the time taken to execute a query and return the results, increase exponentially as the data size increases, leading to more waiting times of the user. Hadoop with its distributed processing capability is considered as an efficient solution for processing such large data. Hadoop's Default Data Placement Strategy (HDDPS) allocates the data blocks randomly across the cluster of nodes without considering any of the execution parameters. This result in non-availability of the blocks required for execution in local machine so that the data has to be transferred across the network for execution, leading to data locality issue. Also, it is commonly observed that most of the data intensive applications show grouping semantics. Hence during query execution, only a part of the Big-Data set is utilized. Since such execution parameters and grouping behavior are not considered, the default placement does not perform well resulting in several lacunas such as decreased local map task execution, increased query execution time, query latency, etc. In order to overcome such issues, an Optimal Data Placement Strategy (ODPS) based on grouping semantics is proposed. Initially, user history log is dynamically analyzed for identifying access pattern which is depicted as a graph. Markov clustering, a Graph clustering algorithm is applied to identify groupings among the dataset. Then, an Optimal Data Placement Algorithm (ODPA) is proposed based on the statistical measures estimated from the clustered graph. This in turn re-organizes the default data layouts in HDFS to achieve improved performance for Big-Data sets in heterogeneous distributed environment. Our proposed strategy is tested in a 15 node cluster placed in a single rack topology. The result has proved to be more efficient for massive datasets, reducing query execution time by 26% and significantly improves the data locality by 38% compared to HDDPS.

Download Full-text

Significance of Hierarchical and Markov Clustering in Grouping Aware Data Placement for Data Intensive Applications Having Interest Locality

Scalable Computing Practice and Experience ◽

10.12694/scpe.v19i3.1375 ◽

2018 ◽

Vol 19 (3) ◽

pp. 245-258

Author(s):

Vengadeswaran Shanmugasundaram ◽

Balasundaram Sadhu Ramakrishnan

Keyword(s):

Big Data ◽

Data Placement ◽

Query Execution ◽

Access Pattern ◽

Clustering Techniques ◽

Data Intensive ◽

Markov Clustering ◽

Default Data ◽

Data Intensive Applications ◽

Grouping Behavior

In this data era, massive volumes of data are being generated every second in variety of domains such as Geoscience, Social Web, Finance, e-Commerce, Health Care, Climate modelling, Physics, Astronomy, Government sectors etc. Hadoop has been well-recognized as de factobig data processing platform that have been extensively adopted, and is currently widely used, in many application domains processing Big Data. Even though it is considered as an efficient solution for such complex query processing, it has its own limitation when the data to be processed exhibit interest locality. The data required for any query execution follows grouping behavior wherein only a part of the Big-Data is accessed frequently. During such scenarion, the time taken to execute a queryand return results, increases exponentially as the amount of data increases leading to much waiting time for the user. Since Hadoop default data placement strategy (HDDPS) does not consider such grouping behavior, it does not perform efficiently resulting in lacunas such as decreased local map task execution, increased query execution time etc. Hence proposed an Optimal Data Placement Strategy (ODPS) based on grouping semantics. In this paper we experiment the significance oftwo most promising clustering techniques viz. Hierarchical Agglomerative Clustering (HAC) and Markov Clustering (MCL) in grouping aware data placement for data intensive applications having interest locality. Initially user access pattern is identified by dynamically analyzing history log.Then both clustering techniques (HAC & MCL) are separately applied over the access pattern to obtain independent clusters. These clusters are interpreted and validated to extract the Optimal Data Groupings (ODG). Finally proposed strategy reorganizes the default data layouts in HDFSbased on ODG to achieve maximum parallel execution per group subjective to Load Balancer and Rack Awareness. Our proposed strategy is tested in 10 node cluster placed in a multi rack with Hadoop installed in every node deployed in cloud platform. Proposed strategy reduces the query execution time, significantly improves the data locality and has proved to be more efficient for massive datasets processing in heterogeneous distributed environment. Also MCL shows a marginal improved performance over HAC for queries exhibiting interest localities.

Download Full-text

Data-Intensive Task Scheduling for Heterogeneous Big Data Analytics in IoT System

Energies ◽

10.3390/en13174508 ◽

2020 ◽

Vol 13 (17) ◽

pp. 4508

Author(s):

Xin Li ◽

Liangyuan Wang ◽

Jemal H. Abawajy ◽

Xiaolin Qin ◽

Giovanni Pau ◽

...

Keyword(s):

Big Data ◽

Data Analysis ◽

Task Scheduling ◽

Execution Time ◽

Data Centers ◽

Big Data Analysis ◽

Data Locality ◽

Data Migration ◽

Task Execution ◽

Task Execution Time

Efficient big data analysis is critical to support applications or services in Internet of Things (IoT) system, especially for the time-intensive services. Hence, the data center may host heterogeneous big data analysis tasks for multiple IoT systems. It is a challenging problem since the data centers usually need to schedule a large number of periodic or online tasks in a short time. In this paper, we investigate the heterogeneous task scheduling problem to reduce the global task execution time, which is also an efficient method to reduce energy consumption for data centers. We establish the task execution for heterogeneous tasks respectively based on the data locality feature, which also indicate the relationship among the tasks, data blocks and servers. We propose a heterogeneous task scheduling algorithm with data migration. The core idea of the algorithm is to maximize the efficiency by comparing the cost between remote task execution and data migration, which could improve the data locality and reduce task execution time. We conduct extensive simulations and the experimental results show that our algorithm has better performance than the traditional methods, and data migration actually works to reduce th overall task execution time. The algorithm also shows acceptable fairness for the heterogeneous tasks.

Download Full-text

Improving the K-Means Clustering Algorithm Oriented to Big Data Environments

Handbook of Research on Natural Language Processing and Smart Service Systems - Advances in Computational Intelligence and Robotics ◽

10.4018/978-1-7998-4730-4.ch013 ◽

2021 ◽

pp. 289-308

Author(s):

Joaquín Pérez Ortega ◽

Nelva Nely Almanza Ortega ◽

Andrea Vega Villalobos ◽

Marco A. Aguirre L. ◽

Crispín Zavala Díaz ◽

...

Keyword(s):

Big Data ◽

Text Mining ◽

Large Volume ◽

Execution Time ◽

Clustering Algorithm ◽

Efficient Algorithms ◽

Experimental Results ◽

Digital Format ◽

Basic Approaches ◽

Previous Iteration

In recent years, the amount of texts in natural language, in digital format, has had an impressive increase. To obtain useful information from a large volume of data, new specialized techniques and efficient algorithms are required. Text mining consists of extracting meaningful patterns from texts; one of the basic approaches is clustering. The most used clustering algorithm is k-means. This chapter proposes an improvement of the k-means algorithm in the convergence step; the process stops whenever the number of objects that change their assigned cluster in the current iteration is bigger than the ones that changed in the previous iteration. Experimental results showed a reduction in execution time up to 93%. It is remarkable that, in general, better results are obtained when the volume of the text increase, particularly in those texts within big data environments.

Download Full-text

Incomplete Big Data Distributed Clustering

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.687-691.1496 ◽

2014 ◽

Vol 687-691 ◽

pp. 1496-1499

Author(s):

Yong Lin Leng

Keyword(s):

Big Data ◽

Incomplete Data ◽

Clustering Algorithm ◽

Design Method ◽

Complete Data ◽

Similarity Metrics ◽

Distributed Clustering ◽

Computing Technology ◽

Data Set ◽

Affinity Propagation Clustering

Partially missing or blurring attribute values make data become incomplete during collecting data. Generally we use inputation or discarding method to deal with incomplete data before clustering. In this paper we proposed an a new similarity metrics algorithm based on incomplete information system. First algorithm divided the data set into a complete data set and non complete data set, and then the complete data set was clustered using the affinity propagation clustering algorithm, incomplete data according to the design method of the similarity metric is divided into the corresponding cluster. In order to improve the efficiency of the algorithm, designing the distributed clustering algorithm based on cloud computing technology. Experiment demonstrates the proposed algorithm can cluster the incomplete big data directly and improve the accuracy and effectively.

Download Full-text

Aras

International Journal of Distributed Systems and Technologies ◽

10.4018/ijdst.2017040104 ◽

2017 ◽

Vol 8 (2) ◽

pp. 47-60 ◽

Cited By ~ 5

Author(s):

Mohammadhossein Barkhordari ◽

Mahdi Niamanesh

Keyword(s):

Big Data ◽

Data Exchange ◽

Data Distribution ◽

Analysis Data ◽

High Rate ◽

Query Execution ◽

Network Congestion ◽

Single Node ◽

Data Set ◽

Exchange Data

Because of to the high rate of data growth and the need for data analysis, data warehouse management for big data is an important issue. Single node solutions cannot manage the large amount of information. Information must be distributed over multiple hardware nodes. Nevertheless, data distribution over nodes causes each node to need data from other nodes to execute a query. Data exchange among nodes creates problems, such as the joins between data segments that exist on different nodes, network congestion, and hardware node wait for data reception. In this paper, the Aras method is proposed. This method is a MapReduce-based method that introduces a data set on each mapper. By applying this method, each mapper node can execute its query independently and without need to exchange data with other nodes. Node independence solves the aforementioned data distribution problems. The proposed method has been compared with prominent data warehouses for big data, and the Aras query execution time was much lower than other methods.

Download Full-text

Research on Clustering Algorithm of Heterogeneous Network Privacy Big Data Set Based on Cloud Computing

Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering - Advanced Hybrid Information Processing ◽

10.1007/978-3-030-67871-5_33 ◽

2021 ◽

pp. 367-376

Author(s):

Ming-hao Ding

Keyword(s):

Cloud Computing ◽

Big Data ◽

Heterogeneous Network ◽

Clustering Algorithm ◽

Data Set

Download Full-text

Exploiting Sharing Join Opportunities in Big Data Multiquery Optimization with Flink

Complexity ◽

10.1155/2020/6617149 ◽

2020 ◽

Vol 2020 ◽

pp. 1-25

Author(s):

Xiao-Yan Gao ◽

Radhya Sahal ◽

Gui-Xiu Chen ◽

Mohammed H. Khafagy ◽

Fatma A. Omara

Keyword(s):

Big Data ◽

Execution Time ◽

Large Scale ◽

Query Execution ◽

Multiple Queries ◽

Intermediate Data ◽

Large Scale Data ◽

Join Queries ◽

Multiquery Optimization ◽

Data Granularity

Multiway join queries incur high-cost I/Os operations over large-scale data. Exploiting sharing join opportunities among multiple multiway joins could be beneficial to reduce query execution time and shuffled intermediate data. Although multiway join optimization has been carried out in MapReduce, different design principles (i.e., in-memory Big Data platforms, Flink) are not considered. To bridge the gap of not considering the optimization of Big Data platforms, an end-to-end multiway join over Flink, which is called Join-MOTH system (J-MOTH), is proposed to exploit sharing data granularity, sharing join granularity, and sharing implicit sorts within multiple join queries. For sharing data, our previous work, Multiquery Optimization using Tuple Size and Histogram (MOTH) system, has been introduced to consider the granularity of sharing data opportunities among multiple queries. For sharing sort, our previous work, Sort-Based Optimizer for Big Data Multiquery (SOOM), has been introduced to consider the implicit sorts among join queries. For sharing join, additional modules have been tailored to the J-MOTH optimizer to optimize sharing work by exploiting shared pipelined multiway join among multiple multiway join queries. The experimental evaluation has demonstrated that the J-MOTH system outperforms the naive and the state-of-the-art techniques by 44% for query execution time using TPC-H queries. Also, the proposed J-MOTH system introduces maximal intermediate data size reduction by 30% in average over Hadoop-like infrastructures.

Download Full-text

KM-MBFO: A Hybrid Hadoop Map Reduce Access for Clustering Big Data by Adopting Modified Bacterial Foraging Optimization Algorithm

International Journal of Recent Technology and Engineering - 2 ◽

10.35940/ijrte.b1024.0982s1119 ◽

2019 ◽

Vol 8 (2S11) ◽

pp. 146-152

Keyword(s):

Execution Time ◽

Clustering Algorithm ◽

Classification Error ◽

Bacterial Foraging Optimization ◽

Data Set ◽

Local Optima ◽

Bacterial Foraging ◽

Knowledge Based ◽

Hadoop Distributed File System ◽

Simulated Test

K-Means Clustering is a very powerful and frequently used algorithm for the clustering, it has got its own limitation. The prevalent K-Means clustering algorithm used for grouping have inadequacies, for example, slow convergence rate, local optima trap, and so on. Therefore, many swarm knowledge based procedures combined with KM for clustering were presented and demonstrated their presentation, its variations and its applications in data grouping. In this paper we intend to propose a parallel organizing strategy for KM-MBFO mechanism that actualized in Hadoop Distributed File System (HDFS) for diminishing the execution time. This Mapper approach produces the populace for given data set for grouping. The Modified Bacterial Foraging Optimization (MBFO) algorithm finds the wellness of the populace to choose the optimal K values as far as execution time and classification error. Through simulated test results, we assess the demonstration of the proposed KM-BFO conspire

Download Full-text

A Comparison of ORC-Compress Performance with Big Data Workload on Virtualization

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.855.153 ◽

2016 ◽

Vol 855 ◽

pp. 153-158

Author(s):

Kritwara Rattanaopas ◽

Sureerat Kaewkeerat ◽

Yanapat Chuchuen

Keyword(s):

Big Data ◽

Execution Time ◽

Large Data ◽

Map Reduce ◽

Data Set ◽

Relational Information ◽

Open Source Data ◽

Space Saving ◽

Source Data ◽

Better Than

Big Data is widely used in many organizations nowadays. Hive is an open source data warehouse system for managing large data set. It provides a SQL-like interface to Hadoop over Map-Reduce framework. Currently, Big Data solution starts to adopt HiveQL tool to improve execution time of relational information. In this paper, we investigate on an execution time of query processing issues comparing two algorithm of ORC file: ZLIB and SNAPPY. The results show that ZLIB can compress data up to 87% compared to NONE compressing data. It was better than SNAPPY which has space saving 79%. However, the key for reducing execution time is Map-Reduce that were shown by a less query execution time when mapper and data node were equal. For example, all query suites in 6-node(ZLIB/SNAPPY) with 250-million table rows has quite similar execution time comparison to 9-node(ZLIB/SNAPPY) with 350-million table rows.

Download Full-text

Fuzzyc-Means and Cluster Ensemble with Random Projection for Big Data Clustering

Mathematical Problems in Engineering ◽

10.1155/2016/6529794 ◽

2016 ◽

Vol 2016 ◽

pp. 1-13 ◽

Cited By ~ 5

Author(s):

Mao Ye ◽

Wenfen Liu ◽

Jianghong Wei ◽

Xuexian Hu

Keyword(s):

Big Data ◽

Clustering Algorithm ◽

State Of The Art ◽

Random Projection ◽

Aggregation Method ◽

Data Set ◽

Cluster Ensemble ◽

Positive Effects ◽

Fcm Clustering ◽

Value Decomposition

Because of its positive effects on dealing with the curse of dimensionality in big data, random projection for dimensionality reduction has become a popular method recently. In this paper, an academic analysis of influences of random projection on the variability of data set and the dependence of dimensions has been proposed. Together with the theoretical analysis, a new fuzzyc-means (FCM) clustering algorithm with random projection has been presented. Empirical results verify that the new algorithm not only preserves the accuracy of original FCM clustering, but also is more efficient than original clustering and clustering with singular value decomposition. At the same time, a new cluster ensemble approach based on FCM clustering with random projection is also proposed. The new aggregation method can efficiently compute the spectral embedding of data with cluster centers based representation which scales linearly with data size. Experimental results reveal the efficiency, effectiveness, and robustness of our algorithm compared to the state-of-the-art methods.

Download Full-text