ANN-inspired Straggler Map Reduce Detection in Big Data Processing

2021 ◽

Vol 15 (4) ◽

pp. 1-23

Author(s):

Karthikeyani Visalakshi N. ◽

Shanthi S. ◽

Lakshmi K.

Keyword(s):

Large Scale ◽

Parallel Implementation ◽

Search Algorithm ◽

Clustering Algorithms ◽

Computation Time ◽

Global Optimum ◽

Data Mining Technique ◽

Local Optima ◽

Large Scale Data ◽

Scale Data

Cluster analysis is the prominent data mining technique in knowledge discovery and it discovers the hidden patterns from the data. The K-Means, K-Modes and K-Prototypes are partition based clustering algorithms and these algorithms select the initial centroids randomly. Because of its random selection of initial centroids, these algorithms provide the local optima in solutions. To solve these issues, the strategy of Crow Search algorithm is employed with these algorithms to obtain the global optimum solution. With the advances in information technology, the size of data increased in a drastic manner from terabytes to petabytes. To make proposed algorithms suitable to handle these voluminous data, the phenomena of parallel implementation of these clustering algorithms with Hadoop Mapreduce framework. The proposed algorithms are experimented with large scale data and the results are compared in terms of cluster evaluation measures and computation time with the number of nodes.

Download Full-text

MapReduce Based Crow Search Adopted Partitional Clustering Algorithms For Handling Large Scale Data

International Journal of Cognitive Informatics and Natural Intelligence ◽

10.4018/ijcini.20211001oa19 ◽

2021 ◽

Vol 15 (4) ◽

pp. 0-0

Keyword(s):

Large Scale ◽

Parallel Implementation ◽

Search Algorithm ◽

Clustering Algorithms ◽

Computation Time ◽

Global Optimum ◽

Data Mining Technique ◽

Local Optima ◽

Large Scale Data ◽

Scale Data

Cluster analysis is the prominent data mining technique in knowledge discovery and it discovers the hidden patterns from the data. The K-Means, K-Modes and K-Prototypes are partition based clustering algorithms and these algorithms select the initial centroids randomly. Because of its random selection of initial centroids, these algorithms provide the local optima in solutions. To solve these issues, the strategy of Crow Search algorithm is employed with these algorithms to obtain the global optimum solution. With the advances in information technology, the size of data increased in a drastic manner from terabytes to petabytes. To make proposed algorithms suitable to handle these voluminous data, the phenomena of parallel implementation of these clustering algorithms with Hadoop Mapreduce framework. The proposed algorithms are experimented with large scale data and the results are compared in terms of cluster evaluation measures and computation time with the number of nodes.

Download Full-text

Optimizations for filter-based join algorithms in MapReduce

Journal of Intelligent & Fuzzy Systems ◽

10.3233/jifs-201220 ◽

2021 ◽

pp. 1-18

Author(s):

Salahaldeen Rababa ◽

Amer Al-Badarneh

Keyword(s):

Cost Analysis ◽

Execution Time ◽

Large Scale ◽

Programming Model ◽

State Of The Art ◽

Total Execution Time ◽

Large Scale Data ◽

Heterogeneous Datasets ◽

Join Algorithms ◽

Scale Data

Large-scale datasets collected from heterogeneous sources often require a join operation to extract valuable information. MapReduce is an efficient programming model for processing large-scale data. However, it has some limitations in processing heterogeneous datasets. This is because of the large amount of redundant intermediate records that are transferred through the network. Several filtering techniques have been developed to improve the join performance, but they require multiple MapReduce jobs to process the input datasets. To address this issue, the adaptive filter-based join algorithms are presented in this paper. Specifically, three join algorithms are introduced to perform the processes of filters creation and redundant records elimination within a single MapReduce job. A cost analysis of the introduced join algorithms shows that the I/O cost is reduced compared to the state-of-the-art filter-based join algorithms. The performance of the join algorithms was evaluated in terms of the total execution time and the total amount of I/O data transferred. The experimental results show that the adaptive Bloom join, semi-adaptive intersection Bloom join, and adaptive intersection Bloom join decrease the total execution time by 30%, 25%, and 35%, respectively; and reduce the total amount of I/O data transferred by 18%, 25%, and 50%, respectively.

Download Full-text

Earlier stage for straggler detection and handling using combined CPU test and LATE methodology

International Journal of Electrical and Computer Engineering (IJECE) ◽

10.11591/ijece.v10i5.pp4910-4917 ◽

2020 ◽

Vol 10 (5) ◽

pp. 4910

Author(s):

Anwar H. Katrawi ◽

Rosni Abdullah ◽

Mohammed Anbar ◽

Ammar Kamal Abasi

Keyword(s):

Power Consumption ◽

Execution Time ◽

Large Scale ◽

Poor Performance ◽

Mapreduce Framework ◽

Computing Systems ◽

Large Scale Data ◽

Detection And Identification ◽

Scale Data

Using MapReduce in Hadoop helps in lowering the execution time and power consumption for large scale data. However, there can be a delay in job processing in circumstances where tasks are assigned to bad or congested machines called "straggler tasks"; which increases the time, power consumptions and therefore increasing the costs and leading to a poor performance of computing systems. This research proposes a hybrid MapReduce framework referred to as the combinatory late-machine (CLM) framework. Implementation of this framework will facilitate early and timely detection and identification of stragglers thereby facilitating prompt appropriate and effective actions.

Download Full-text

Complex Causal Structures of Neighbourhood Change: Evidence From a Functionalist Model and Yelp Data

10.31235/osf.io/wprf8 ◽

2021 ◽

Author(s):

Daniel Silver ◽

Thiago H Silva

Keyword(s):

Large Scale ◽

Causal Structure ◽

Functional Model ◽

Large Scale Data ◽

Proposed Model ◽

Neighbourhood Change ◽

Key Aspects ◽

Open Question ◽

Causal Structures ◽

Scale Data

Why some neighbourhoods change over time but others retain their identity remains an open question. Several attempts have been made to answer this question, with a family of models emerging as a result. However, empirically evaluating neighbourhood evolution models is a challenging task, because most require information that is difficult to obtain in traditional sources. For this reason, researchers have turned to new datasets, such as census microdata, Twitter, and Yelp. In this study, we articulate a functional model of neighbourhood change and continuity, adapted from a classical functionalist model proposed by Stinchcombe in 1968. We argue this model provides a relatively simple way to capture key aspects of the complex causal structure of neighbourhood change that are implicit in much neighbourhood change research but rarely formulated explicitly. We demonstrate how to assess the proposed model empirically using large-scale data from Yelp.com. Our results indicate that our approach can potentially help to understand the nature of neighbourhood change and be useful in different applications.

Download Full-text

Performance Analysis of Cloud Systems with Load Dependent Virtual Machine Activation and Sleep Modes

International Journal of Applied Industrial Engineering ◽

10.4018/ijaie.2018070101 ◽

2018 ◽

Vol 5 (2) ◽

pp. 1-20

Author(s):

Sudhansu Shekhar Patra ◽

Veena Goswami

Keyword(s):

Large Scale ◽

Data Centers ◽

Web Applications ◽

Virtual Machines ◽

Electrical Power ◽

Threshold Level ◽

Internet Technology ◽

Large Scale Data ◽

Proposed Model ◽

Scale Data

Due to the advancements in virtualization technology, it is now an up and coming field and has become a more appealing area of internet technology. Since there is a rapid growth for the demand of computational power increases by scientific, business, and web-applications, it leads to the creation of large-scale data centers. These data centers consume enormous amounts of electrical power. In this article, the authors study energy saving methods by consolidation and by switching off those virtual machines which are not in use. According to this policy, c virtual machines continue serving the customer until the number of idle server attains the threshold level d; then d idle servers take synchronous vacation simultaneously, otherwise these servers would begin serving the customers. Numerical results are provided to demonstrate the applicability of the proposed model for data center management in particular, to quantify the tradeoff theoretically between the conflicting aims of energy efficiency and QoS.

Download Full-text

SA Sorting: A Novel Sorting Technique for Large-Scale Data

Journal of Computer Networks and Communications ◽

10.1155/2019/3027578 ◽

2019 ◽

Vol 2019 ◽

pp. 1-7

Author(s):

Mohammad Shabaz ◽

Ashok Kumar

Keyword(s):

Big Data ◽

Data Structures ◽

Execution Time ◽

Large Scale ◽

Special Situation ◽

Large Scale Data ◽

Sorting Technique ◽

Logical Order ◽

Scale Data ◽

Quick Sort

Sorting is one of the operations on data structures used in a special situation. Sorting is defined as an arrangement of data or records in a particular logical order. A number of algorithms are developed for sorting the data. The reason behind developing these algorithms is to optimize the efficiency and complexity. The work on creating new sorting approaches is still going on. With the rise in the generation of big data, the concept of big number comes into existence. To sort thousands of records either sorted or unsorted, traditional sorting approaches can be used. In those cases, we can ignore the complexities as very minute difference exists in their execution time. But in case the data are very large, where execution time or processed time of billion or trillion of records is very large, we cannot ignore the complexity at this situation; therefore, an optimized sorting approach is required. Thus, SA sorting is one of the approaches developed to check sorted big numbers as it works better on sorted numbers than quick sort and many others. It can also be used to sort unsorted records as well.

Download Full-text