A gray-box modeling methodology for runtime prediction of Apache Spark jobs

Abstract Apache Spark jobs are often characterized by processing huge data sets and, therefore, require runtimes in the range of minutes to hours. Thus, being able to predict the runtime of such jobs would be useful not only to know when the job will finish, but also for scheduling purposes, to estimate monetary costs for cloud deployment, or to determine an appropriate cluster configuration, such as the number of nodes. However, predicting Spark job runtimes is much more challenging than for standard database queries: cluster configuration and parameters have a significant performance impact and jobs usually contain a lot of user-defined code making it difficult to estimate cardinalities and execution costs. In this paper, we present a gray-box modeling methodology for runtime prediction of Apache Spark jobs. Our approach comprises two steps: first, a white-box model for predicting the cardinalities of the input RDDs of each operator is built based on prior knowledge about the behavior and application parameters such as applied filters data, number of iterations, etc. In the second step, a black-box model for each task constructed by monitoring runtime metrics while varying allocated resources and input RDD cardinalities is used. We further show how to use this gray-box approach not only for predicting the runtime of a given job, but also as part of a decision model for reusing intermediate cached results of Spark jobs. Our methodology is validated with experimental evaluation showing a highly accurate prediction of the actual job runtime and a performance improvement if intermediate results can be reused.

Download Full-text

Gray Box Modeling Methodology for Runtime Prediction of Apache Spark Jobs

2019 IEEE 35th International Conference on Data Engineering Workshops (ICDEW) ◽

10.1109/icdew.2019.00-23 ◽

2019 ◽

Author(s):

Hani Al-Sayeh ◽

Kai-Uwe Sattler

Keyword(s):

Apache Spark ◽

Runtime Prediction ◽

Modeling Methodology

Download Full-text

Quick Knowledge Reduction Based on Divide and Conquer Method in Huge Data Sets

Lecture Notes in Computer Science - Pattern Recognition and Machine Intelligence ◽

10.1007/978-3-540-77046-6_39 ◽

2007 ◽

pp. 312-315 ◽

Cited By ~ 2

Author(s):

Guoyin Wang ◽

Feng Hu

Keyword(s):

Divide And Conquer ◽

Data Sets ◽

Knowledge Reduction ◽

Huge Data

Download Full-text

SKT

Proceedings of the VLDB Endowment ◽

10.14778/3476249.3476287 ◽

2021 ◽

Vol 14 (11) ◽

pp. 2369-2382

Author(s):

Monica Chiosa ◽

Thomas B. Preußer ◽

Gustavo Alonso

Keyword(s):

Frequency Distribution ◽

Empirical Evaluation ◽

Large Data ◽

Cloud Service ◽

Data Sets ◽

Data Set ◽

Single Pass ◽

Trade Offs ◽

Significant Performance ◽

Spatial Architecture

Data analysts often need to characterize a data stream as a first step to its further processing. Some of the initial insights to be gained include, e.g., the cardinality of the data set and its frequency distribution. Such information is typically extracted by using sketch algorithms, now widely employed to process very large data sets in manageable space and in a single pass over the data. Often, analysts need more than one parameter to characterize the stream. However, computing multiple sketches becomes expensive even when using high-end CPUs. Exploiting the increasing adoption of hardware accelerators, this paper proposes SKT , an FPGA-based accelerator that can compute several sketches along with basic statistics (average, max, min, etc.) in a single pass over the data. SKT has been designed to characterize a data set by calculating its cardinality, its second frequency moment, and its frequency distribution. The design processes data streams coming either from PCIe or TCP/IP, and it is built to fit emerging cloud service architectures, such as Microsoft's Catapult or Amazon's AQUA. The paper explores the trade-offs of designing sketch algorithms on a spatial architecture and how to combine several sketch algorithms into a single design. The empirical evaluation shows how SKT on an FPGA offers a significant performance gain over high-end, server-class CPUs.

Download Full-text

Construction and Analysis of Protein-Protein Interaction Network

Advances in Medical Technologies and Clinical Practice - Computer Applications in Drug Discovery and Development ◽

10.4018/978-1-5225-7326-5.ch009 ◽

2019 ◽

pp. 204-220

Author(s):

Divya Dasagrandhi ◽

Arul Salomee Kamalabai Ravindran ◽

Anusuyadevi Muthuswamy ◽

Jayachandran K. S.

Keyword(s):

Protein Interaction ◽

Protein Interaction Network ◽

Interaction Network ◽

Data Sets ◽

Protein Protein Interaction ◽

Huge Data ◽

Associated Proteins ◽

High Throughput Experiments ◽

Generation Sequencing ◽

Protein Protein Interaction Network

Understanding the mechanisms of a disease is highly complicated due to the complex pathways involved in the disease progression. Despite several decades of research, the occurrence and prognosis of the diseases is not completely understood even with high throughput experiments like DNA microarray and next-generation sequencing. This is due to challenges in analysis of huge data sets. Systems biology is one of the major divisions of bioinformatics and has laid cutting edge techniques for the better understanding of these pathways. Construction of protein-protein interaction network (PPIN) guides the modern scientists to identify vital proteins through protein-protein interaction network, which facilitates the identification of new drug target and associated proteins. The chapter is focused on PPI databases, construction of PPINs, and its analysis.

Download Full-text

Remote Patient Monitoring for Healthcare

Advances in Data Mining and Database Management - Managerial Perspectives on Intelligent Big Data Analytics ◽

10.4018/978-1-5225-7277-0.ch009 ◽

2019 ◽

pp. 163-179 ◽

Cited By ~ 2

Author(s):

Andrew Stranieri ◽

Venki Balasubramanian

Keyword(s):

Real Time ◽

Data Analytics ◽

Patient Monitoring ◽

Wearable Sensors ◽

Data Sets ◽

Remote Patient Monitoring ◽

Data Mining Algorithms ◽

Real Time Analysis ◽

Huge Data ◽

Remote Patient

Remote patient monitoring involves the collection of data from wearable sensors that typically requires analysis in real time. The real-time analysis of data streaming continuously to a server challenges data mining algorithms that have mostly been developed for static data residing in central repositories. Remote patient monitoring also generates huge data sets that present storage and management problems. Although virtual records of every health event throughout an individual's lifespan known as the electronic health record are rapidly emerging, few electronic records accommodate data from continuous remote patient monitoring. These factors combine to make data analytics with continuous patient data very challenging. In this chapter, benefits for data analytics inherent in the use of standards for clinical concepts for remote patient monitoring is presented. The openEHR standard that describes the way in which concepts are used in clinical practice is well suited to be adopted as the standard required to record meta-data about remote monitoring. The claim is advanced that this is likely to facilitate meaningful real time analyses with big remote patient monitoring data. The point is made by drawing on a case study involving the transmission of patient vital sign data collected from wearable sensors in an Indian hospital.

Download Full-text

Data Mining for Fraud Detection System

Encyclopedia of Data Warehousing and Mining, Second Edition ◽

10.4018/978-1-60566-010-3.ch065 ◽

2011 ◽

pp. 411-416

Author(s):

Roberto Marmo

Keyword(s):

Data Mining ◽

Credit Card ◽

User Behavior ◽

Detection System ◽

Fraud Detection ◽

Modern Technology ◽

Data Sets ◽

Security Measures ◽

Economic Activities ◽

Huge Data

As a conseguence of expansion of modern technology, the number and scenario of fraud are increasing dramatically. Therefore, the reputation blemish and losses caused are primary motivations for technologies and methodologies for fraud detection that have been applied successfully in some economic activities. The detection involves monitoring the behavior of users based on huge data sets such as the logged data and user behavior. The aim of this contribution is to show some data mining techniques for fraud detection and prevention with applications in credit card and telecommunications, within a business of mining the data to achieve higher cost savings, and also in the interests of determining potential legal evidence. The problem is very difficult because fraudsters takes many different forms and are adaptive, so they will usually look for ways to avoid every security measures.

Download Full-text

The S2 Baseband Processing System for Phase-coherent Pulsar Observations

International Astronomical Union Colloquium ◽

10.1017/s0252921100040926 ◽

1996 ◽

Vol 160 ◽

pp. 21-22

Author(s):

R. Wietfeldt ◽

W. Van Straten ◽

D. Del Rizzo ◽

N. Bartel ◽

W. Cannon ◽

...

Keyword(s):

High Speed ◽

Processing System ◽

High Time Resolution ◽

Data Sets ◽

Future Developments ◽

Huge Data ◽

Pulse Timing ◽

First Results ◽

Vela Pulsar ◽

Baseband Processing

AbstractThe phase-coherent recording of pulsar data and subsequent software dispersion removal provide a flexible way to reach the limits of high time resolution, useful for more precise pulse timing and the study of fast signal fluctuations within a pulse. Because of the huge data rate and lack of adequate recording and computing capabilities, this technique has been used mostly only for small pulsar data sets. In recent years, however, the development of very capable, reasonably inexpensive high-speed recording systems and computers has made feasible the notion of pulsar baseband recording and subsequent processing with a workstation/computer. In this paper we discuss the development of a phase-coherent baseband processing system for radio pulsar observations. This system is based on the S2 VLBI recorder developed at ISTS/York University in Toronto, Canada. We present preliminary first results for data from the Vela pulsar, obtained at Parkes, Australia, and processed at ISTS/York University, and discuss plans for future developments.

Download Full-text

On the Effectiveness of Hybrid Canopy with Hoeffding Adaptive Naive Bayes Trees

International Journal of Applied Evolutionary Computation ◽

10.4018/ijaec.2017040102 ◽

2017 ◽

Vol 8 (2) ◽

pp. 30-43

Author(s):

Mrutyunjaya Panda

Keyword(s):

Big Data ◽

Clustering Analysis ◽

Large Scale ◽

Data Sets ◽

Recent Past ◽

Large Scale Data ◽

Huge Data ◽

With Memory ◽

Memory Constraints ◽

Scale Data

The Big Data, due to its complicated and diverse nature, poses a lot of challenges for extracting meaningful observations. This sought smart and efficient algorithms that can deal with computational complexity along with memory constraints out of their iterative behavior. This issue may be solved by using parallel computing techniques, where a single machine or a multiple machine can perform the work simultaneously, dividing the problem into sub problems and assigning some private memory to each sub problems. Clustering analysis are found to be useful in handling such a huge data in the recent past. Even though, there are many investigations in Big data analysis are on, still, to solve this issue, Canopy and K-Means++ clustering are used for processing the large-scale data in shorter amount of time with no memory constraints. In order to find the suitability of the approach, several data sets are considered ranging from small to very large ones having diverse filed of applications. The experimental results opine that the proposed approach is fast and accurate.

Download Full-text

Implementation of Supervised Learning towards Optimizing Queries in Database Systems

International Journal of Engineering and Advanced Technology - Regular Issue ◽

10.35940/ijeat.b3531.129219 ◽

2019 ◽

Vol 9 (2) ◽

pp. 1182-1187

Keyword(s):

Machine Learning ◽

Supervised Learning ◽

Student Loans ◽

Large Data ◽

Database Systems ◽

Large Data Sets ◽

Data Sets ◽

Human Intervention ◽

Huge Data ◽

Future Direction

Machine learning is a technology which with accumulated data provides better decisions towards future applications. It is also the scientific study of algorithms implemented efficiently to perform a specific task without using explicit instructions. It may also be viewed as a subset of artificial intelligence in which it may be linked with the ability to automatically learn and improve from experience without being explicitly programmed. Its primary intention is to allow the computers learn automatically and produce more accurate results in order to identify profitable opportunities. Combining machine learning with AI and cognitive technologies can make it even more effective in processing large volumes human intervention or assistance and adjust actions accordingly. It may enable analyzing the huge data of information. It may also be linked to algorithm driven study towards improving the performance of the tasks. In such scenario, the techniques can be applied to judge and predict large data sets. The paper concerns the mechanism of supervised learning in the database systems, which would be self driven as well as secure. Also the citation of an organization dealing with student loans has been presented. The paper ends discussion, future direction and conclusion.

Download Full-text

RANSAC FOR OUTLIER DETECTION

Geodesy and Cartography ◽

10.3846/13921541.2005.9636670 ◽

2012 ◽

Vol 31 (3) ◽

pp. 83-87 ◽

Cited By ~ 1

Author(s):

Birutė Ruzgienė ◽

Wolfgang Förstner

Keyword(s):

Image Processing ◽

Image Analysis ◽

Image Matching ◽

Data Sets ◽

Digital Photogrammetry ◽

Huge Data ◽

Straight Line ◽

Sample Data ◽

Random Sample Consensus ◽

Classical Image

Up-to-date digital photogrammetry involves operations on huge data sets, and with classical image processing procedures it might be time consuming to find out the best solution. One of the key tasks is to detect outliers in given data, eg for curve fitting or image matching. The problem is hard as the number of outliers is usually large, possibly larger than 50%, thus powerful estimation techniques are needed. We demonstrate one of these techniques, namely Random Sample Consensus (RANSAC), for fitting a model to sample data, especially for fitting a straight line through a set of given points. Experiments with up to 80% outliers prove the efficiency of RANSAC. The results are representative for image analysis in digital photogrammetry

Download Full-text