Predicting SQL Query Execution Time for Large Data Volume

Author(s):  
Rekha Singhal ◽  
Manoj Nambiar
Author(s):  
Aleksey Burdakov ◽  
Viktoria Proletarskaya ◽  
Andrey Ploutenko ◽  
Oleg Ermakov ◽  
Uriy Grigorev

2018 ◽  
Vol 10 (1) ◽  
Author(s):  
Aaron Kite-Powell ◽  
Michael Coletta ◽  
Jamie Smimble

Objective: The objective of this work is to describe the use and performance of the NSSP ESSENCE system by analyzing the structured query language (SQL) logs generated by users of the National Syndromic Surveillance Program’s (NSSP) Electronic Surveillance System for the Early Notification of Community-based Epidemics (ESSENCE).Introduction: As system users develop queries within ESSENCE, they step through the user-interface to select data sources and parameters needed for their query. Then they select from the available output options (e.g., time series, table builder, data details). These activities execute a SQL query on the database, the majority of which are saved in a log so that system developers can troubleshoot problems. Secondarily, these data can be used as a form of web analytics to describe user query choices, query volume, query execution time, and develop an understanding of ESSENCE query patterns.Methods: ESSENCE SQL query logs were extracted from April 1, 2016 to August 23th, 2017. Overall query volume was assessed by summarizing volume of queries over time (e.g., by hour, day, and week), and by Site. To better understand system performance the mean, median, and maximum query execution times were summarized over time and by Site. SQL query text was parsed so that we could isolate, 1) Syndromes queried, 2) Sub-syndromes queried, 3) Keyword categories queried, and 4) Free text query terms used. Syndromes, sub-syndromes, and keyword categories were tabulated in total and by Site. Frequencies of free text query terms were analyzed using n-grams, wordclouds, and term co-occurrence relationships. Term co-occurrence network graphs were used to visualize the structure and relationships among terms.Results: There were a total of 354,101 SQL queries generated by users of ESSENCE between April 1, 2016 and August 23rd, 2017. Over this entire time period there was a weekly mean of 4,785 SQL queries performed by users. When looking at 2017 data through August 23rd this figure increases to a mean of 7,618 SQL queries per week for 2017, and since May 2017 the mean number of SQL queries has increased to 10,485 per week. The maximum number of user generated SQL queries in a week was 29,173. The mean, median, and maximum query execution times for all data was 0.61 minutes, 0 minutes, and 365 minutes, respectively. When looking at only queries with a free text component the mean query execution time increases slightly to 0.94 minutes, though the median is still 0 minutes. The peak usage period based on number of SQL queries performed is between 12:00pm and 3:00pm EST.Conclusions: The use of NSSP ESSENCE has grown since implementation. This is the first time the ESSENCE system has been used at a National level with this volume of data, and number of users. Our focus to date has been on successfully on-boarding new Sites so that they can benefit from use of the available tools, providing trainings to new users, and optimizing ESSENCE performance. Routine analysis of the ESSENCE SQL logs can assist us in understanding how the system is being used, how well it is performing, and in evaluating our system optimization efforts.


2021 ◽  
Vol 17 (2) ◽  
pp. 1-45
Author(s):  
Cheng Pan ◽  
Xiaolin Wang ◽  
Yingwei Luo ◽  
Zhenlin Wang

Due to large data volume and low latency requirements of modern web services, the use of an in-memory key-value (KV) cache often becomes an inevitable choice (e.g., Redis and Memcached). The in-memory cache holds hot data, reduces request latency, and alleviates the load on background databases. Inheriting from the traditional hardware cache design, many existing KV cache systems still use recency-based cache replacement algorithms, e.g., least recently used or its approximations. However, the diversity of miss penalty distinguishes a KV cache from a hardware cache. Inadequate consideration of penalty can substantially compromise space utilization and request service time. KV accesses also demonstrate locality, which needs to be coordinated with miss penalty to guide cache management. In this article, we first discuss how to enhance the existing cache model, the Average Eviction Time model, so that it can adapt to modeling a KV cache. After that, we apply the model to Redis and propose pRedis, Penalty- and Locality-aware Memory Allocation in Redis, which synthesizes data locality and miss penalty, in a quantitative manner, to guide memory allocation and replacement in Redis. At the same time, we also explore the diurnal behavior of a KV store and exploit long-term reuse. We replace the original passive eviction mechanism with an automatic dump/load mechanism, to smooth the transition between access peaks and valleys. Our evaluation shows that pRedis effectively reduces the average and tail access latency with minimal time and space overhead. For both real-world and synthetic workloads, our approach delivers an average of 14.0%∼52.3% latency reduction over a state-of-the-art penalty-aware cache management scheme, Hyperbolic Caching (HC), and shows more quantitative predictability of performance. Moreover, we can obtain even lower average latency (1.1%∼5.5%) when dynamically switching policies between pRedis and HC.


2018 ◽  
Vol 4 (12) ◽  
pp. 142 ◽  
Author(s):  
Hongda Shen ◽  
Zhuocheng Jiang ◽  
W. Pan

Hyperspectral imaging (HSI) technology has been used for various remote sensing applications due to its excellent capability of monitoring regions-of-interest over a period of time. However, the large data volume of four-dimensional multitemporal hyperspectral imagery demands massive data compression techniques. While conventional 3D hyperspectral data compression methods exploit only spatial and spectral correlations, we propose a simple yet effective predictive lossless compression algorithm that can achieve significant gains on compression efficiency, by also taking into account temporal correlations inherent in the multitemporal data. We present an information theoretic analysis to estimate potential compression performance gain with varying configurations of context vectors. Extensive simulation results demonstrate the effectiveness of the proposed algorithm. We also provide in-depth discussions on how to construct the context vectors in the prediction model for both multitemporal HSI and conventional 3D HSI data.


2021 ◽  
Author(s):  
Rens Hofman ◽  
Joern Kummerow ◽  
Simone Cesca ◽  
Joachim Wassermann ◽  
Thomas Plenefisch ◽  
...  

<p>The AlpArray seismological experiment is an international and interdisciplinary project to advance our understanding of geophysical processes in the greater Alpine region. The heart of the project consists of a large seismological array that covers the mountain range and its surrounding areas. To understand how the Alps and their neighbouring mountain belts evolved through time, we can only study its current structure and processes. The Eastern Alps are of prime interest since they currently demonstrate the highest crustal deformation rates. A key question is how these surface processes are linked to deeper structures. The Swath-D network is an array of temporary seismological stations complementary to the AlpArray network located in the Eastern Alps. This creates a unique opportunity to investigate high resolution seismicity on a local scale.</p><p>In this study, a combination of waveform-based detection methods was used to find small earthquakes in the large data volume of the Swath-D network. Methods were developed to locate the seismic events using semi-automatic picks, and estimate event magnitudes. We present an overview of the methods and workflow, as well as a preliminary overview of the seismicity in the Eastern Alps.</p>


2014 ◽  
Vol 7 (14) ◽  
pp. 1857-1868 ◽  
Author(s):  
Wentao Wu ◽  
Xi Wu ◽  
Hakan Hacigümüş ◽  
Jeffrey F. Naughton

2020 ◽  
Vol 5 (2) ◽  
pp. 13-32
Author(s):  
Hye-Kyung Yang ◽  
Hwan-Seung Yong

AbstractPurposeWe propose InParTen2, a multi-aspect parallel factor analysis three-dimensional tensor decomposition algorithm based on the Apache Spark framework. The proposed method reduces re-decomposition cost and can handle large tensors.Design/methodology/approachConsidering that tensor addition increases the size of a given tensor along all axes, the proposed method decomposes incoming tensors using existing decomposition results without generating sub-tensors. Additionally, InParTen2 avoids the calculation of Khari–Rao products and minimizes shuffling by using the Apache Spark platform.FindingsThe performance of InParTen2 is evaluated by comparing its execution time and accuracy with those of existing distributed tensor decomposition methods on various datasets. The results confirm that InParTen2 can process large tensors and reduce the re-calculation cost of tensor decomposition. Consequently, the proposed method is faster than existing tensor decomposition algorithms and can significantly reduce re-decomposition cost.Research limitationsThere are several Hadoop-based distributed tensor decomposition algorithms as well as MATLAB-based decomposition methods. However, the former require longer iteration time, and therefore their execution time cannot be compared with that of Spark-based algorithms, whereas the latter run on a single machine, thus limiting their ability to handle large data.Practical implicationsThe proposed algorithm can reduce re-decomposition cost when tensors are added to a given tensor by decomposing them based on existing decomposition results without re-decomposing the entire tensor.Originality/valueThe proposed method can handle large tensors and is fast within the limited-memory framework of Apache Spark. Moreover, InParTen2 can handle static as well as incremental tensor decomposition.


2020 ◽  
Vol 16 (11) ◽  
pp. e1008415
Author(s):  
Teresa Maria Rosaria Noviello ◽  
Francesco Ceccarelli ◽  
Michele Ceccarelli ◽  
Luigi Cerulo

Small non-coding RNAs (ncRNAs) are short non-coding sequences involved in gene regulation in many biological processes and diseases. The lack of a complete comprehension of their biological functionality, especially in a genome-wide scenario, has demanded new computational approaches to annotate their roles. It is widely known that secondary structure is determinant to know RNA function and machine learning based approaches have been successfully proven to predict RNA function from secondary structure information. Here we show that RNA function can be predicted with good accuracy from a lightweight representation of sequence information without the necessity of computing secondary structure features which is computationally expensive. This finding appears to go against the dogma of secondary structure being a key determinant of function in RNA. Compared to recent secondary structure based methods, the proposed solution is more robust to sequence boundary noise and reduces drastically the computational cost allowing for large data volume annotations. Scripts and datasets to reproduce the results of experiments proposed in this study are available at: https://github.com/bioinformatics-sannio/ncrna-deep.


2020 ◽  
Vol 4 (4) ◽  
pp. 191
Author(s):  
Mohammad Aljanabi ◽  
Hind Ra'ad Ebraheem ◽  
Zahraa Faiz Hussain ◽  
Mohd Farhan Md Fudzee ◽  
Shahreen Kasim ◽  
...  

Much attention has been paid to large data technologies in the past few years mainly due to its capability to impact business analytics and data mining practices, as well as the possibility of influencing an ambit of a highly effective decision-making tools. With the current increase in the number of modern applications (including social media and other web-based and healthcare applications) which generates high data in different forms and volume, the processing of such huge data volume is becoming a challenge with the conventional data processing tools. This has resulted in the emergence of big data analytics which also comes with many challenges. This paper introduced the use of principal components analysis (PCA) for data size reduction, followed by SVM parallelization. The proposed scheme in this study was executed on the Spark platform and the experimental findings revealed the capability of the proposed scheme to reduce the classifiers’ classification time without much influence on the classification accuracy of the classifier.


Sign in / Sign up

Export Citation Format

Share Document