TinyLFU-Based Semi-Stream Cache Join for Near-Real-Time Data Warehousing

Mapping Intimacies ◽

10.21203/rs.3.rs-944044/v1 ◽

2022 ◽

Author(s):

M. Asif Naeem ◽

Wasiullah Waqar ◽

Farhaan Mirza ◽

Ali Tahir

Keyword(s):

Real Time ◽

Data Warehousing ◽

Cost Model ◽

Research Problem ◽

Daily Basis ◽

Stream Data ◽

Time Data ◽

Business Decisions ◽

Real Time Data ◽

Modern Era

Abstract Semi-stream join is an emerging research problem in the domain of near-real-time data warehousing. A semi-stream join is basically a join between a fast stream (S) and a slow disk-based relation (R). In the modern era of technology, huge amounts of data are being generated swiftly on a daily basis which needs to be instantly analyzed for making successful business decisions. Keeping this in mind, a famous algorithm called CACHEJOIN (Cache Join) was proposed. The limitation of the CACHEJOIN algorithm is that it does not deal with the frequently changing trends in a stream data efficiently. To overcome this limitation, in this paper we propose a TinyLFU-CACHEJOIN algorithm, a modified version of the original CACHEJOIN algorithm, which is designed to enhance the performance of a CACHEJOIN algorithm. TinyLFU-CACHEJOIN employs an intelligent strategy which keeps only those records of $R$ in the cache that have a high hit rate in S. This mechanism of TinyLFU-CACHEJOIN allows it to deal with the sudden and abrupt trend changes in S. We developed a cost model for our TinyLFU-CACHEJOIN algorithm and proved it empirically. We also assessed the performance of our proposed TinyLFU-CACHEJOIN algorithm with the existing CACHEJOIN algorithm on a skewed synthetic dataset. The experiments proved that TinyLFU-CACHEJOIN algorithm significantly outperforms the CACHEJOIN algorithm.

Download Full-text

Parallelisation of a Cache-Based Stream-Relation Join for a Near-Real-Time Data Warehouse

Electronics ◽

10.3390/electronics9081299 ◽

2020 ◽

Vol 9 (8) ◽

pp. 1299

Author(s):

M. Asif Naeem ◽

Habib Khan ◽

Saad Aslam ◽

Noreen Jamil

Keyword(s):

Real Time ◽

Data Warehouse ◽

Cost Model ◽

Parallel Execution ◽

Stream Data ◽

Time Data ◽

Sales Data ◽

Master Data ◽

Real Time Data ◽

Two Phases

Near real-time data warehousing is an important area of research, as business organisations want to analyse their businesses sales with minimal latency. Therefore, sales data generated by data sources need to reflect immediately in the data warehouse. This requires near-real-time transformation of the stream of sales data with a disk-based relation called master data in the staging area. For this purpose, a stream-relation join is required. The main problem in stream-relation joins is the different nature of inputs; stream data is fast and bursty, whereas the disk-based relation is slow due to high disk I/O cost. To resolve this problem, a famous algorithm CACHEJOIN (cache join) was published in the literature. The algorithm has two phases, the disk-probing phase and the stream-probing phase. These two phases execute sequentially; that means stream tuples wait unnecessarily due to the sequential execution of both phases. This limits the algorithm to exploiting CPU resources optimally. In this paper, we address this issue by presenting a robust algorithm called PCSRJ (parallelised cache-based stream relation join). The new algorithm enables the execution of both disk-probing and stream-probing phases of CACHEJOIN in parallel. The algorithm distributes the disk-based relation on two separate nodes and enables parallel execution of CACHEJOIN on each node. The algorithm also implements a strategy of splitting the stream data on each node depending on the relevant part of the relation. We developed a cost model for PCSRJ and validated it empirically. We compared the service rates of both algorithms using a synthetic dataset. Our experiments showed that PCSRJ significantly outperforms CACHEJOIN.

Download Full-text

Online Processing of End-User Data in Real-Time Data Warehousing

Improving Knowledge Discovery through the Integration of Data Mining Techniques - Advances in Data Mining and Database Management ◽

10.4018/978-1-4666-8513-0.ch002 ◽

2015 ◽

pp. 13-31

Author(s):

M. Asif Naeem ◽

Noreen Jamil

Keyword(s):

Real Time ◽

Cost Model ◽

Main Memory ◽

Stream Data ◽

Time Data ◽

Online Processing ◽

Master Data ◽

Real Time Data ◽

User Data ◽

Resource Aware

Stream-based join algorithms are a promising technology for modern real-time data warehouses. A particular category of stream-based joins is a semi-stream join where a single stream is joined with a disk based master data. The join operator typically works under limited main memory and this memory is generally not large enough to hold the whole disk-based master data. Recently, a seminal join algorithm called MESHJOIN (Mesh Join) has been proposed in the literature to process semi-stream data. MESHJOIN is a candidate for a resource-aware system setup. However, MESHJOIN is not very selective. In particular, MESHJOIN does not consider the characteristics of stream data and its performance is suboptimal for skewed stream data. This chapter presents a novel Cached-based Semi-Stream Join (CSSJ) using a cache module. The algorithm is more appropriate for skewed distributions, and we present results for Zipfian distributions of the type that appear in many applications. We conduct a rigorous experimental study to test our algorithm. Our experiments show that CSSJ outperforms MESHJOIN significantly. We also present the cost model for our CSSJ and validate it with experiments.

Download Full-text

Optimizing Semi-Stream CACHEJOIN for Near-Real- Time Data Warehousing

Journal of Database Management ◽

10.4018/jdm.2020010102 ◽

2020 ◽

Vol 31 (1) ◽

pp. 20-37 ◽

Cited By ~ 1

Author(s):

M. Asif Naeem ◽

Erum Mehmood ◽

M. G. Abbas Malik ◽

Noreen Jamil

Keyword(s):

Real Time ◽

Data Warehousing ◽

Streaming Data ◽

Service Rate ◽

Stream Data ◽

Time Data ◽

Join Algorithm ◽

Real Time Data ◽

Two Phases ◽

Critical Process

Streaming data join is a critical process in the field of near-real-time data warehousing. For this purpose, an adaptive semi-stream join algorithm called CACHEJOIN (Cache Join) focusing non-uniform stream data is provided in the literature. However, this algorithm cannot exploit the memory and CPU resources optimally and consequently it leaves its service rate suboptimal due to sequential execution of both of its phases, called stream-probing (SP) phase and disk-probing (DP) phase. By integrating the advantages of CACHEJOIN, this article presents two modifications for it. The first is called P-CACHEJOIN (Parallel Cache Join) that enables the parallel processing of two phases in CACHEJOIN. This increases number of joined stream records and therefore improves throughput considerably. The second is called OP-CACHEJOIN (Optimized Parallel Cache Join) that implements a parallel loading of stored data into memory while the DP phase is executing. This research presents the performance analysis of both of the approaches defined within the paper existing CACHEJOIN empirically using synthetic skewed dataset.

Download Full-text

Development of a New KPI for the Economic Quantification of Six Big Losses and Its Implementation in a Cyber Physical System

Applied Sciences ◽

10.3390/app10249154 ◽

2020 ◽

Vol 10 (24) ◽

pp. 9154

Author(s):

Paula Morella ◽

María Pilar Lambán ◽

Jesús Royo ◽

Juan Carlos Sánchez ◽

Jaime Latapia

Keyword(s):

Real Time ◽

Physical System ◽

Performance Indicator ◽

Cost Model ◽

Relevant Information ◽

Cyber Physical System ◽

Time Data ◽

Real Time Data ◽

Different Dimensions ◽

The Cost

The purpose of this work is to develop a new Key Performance Indicator (KPI) that can quantify the cost of Six Big Losses developed by Nakajima and implements it in a Cyber Physical System (CPS), achieving a real-time monitorization of the KPI. This paper follows the methodology explained below. A cost model has been used to accurately develop this indicator together with the Six Big Losses description. At the same time, the machine tool has been integrated into a CPS, enhancing the real-time data acquisition, using the Industry 4.0 technologies. Once the KPI has been defined, we have developed the software that can turn these real-time data into relevant information (using Python) through the calculation of our indicator. Finally, we have carried out a case of study showing our new KPI results and comparing them to other indicators related with the Six Big Losses but in different dimensions. As a result, our research quantifies economically the Six Big Losses, enhances the detection of the bigger ones to improve them, and enlightens the importance of paying attention to different dimensions, mainly, the productive, sustainable, and economic at the same time.

Download Full-text

Issues and Handy Solutions Addressed at Every Stage in Real Time Data Warehousing, I.E. ETL (Extraction, Transformation & Loading)

International Journal of Engineering and Advanced Technology - Regular Issue ◽

10.35940/ijeat.e1100.0785s319 ◽

2019 ◽

Vol 8 (5S3) ◽

pp. 344-348

Keyword(s):

Real Time ◽

Data Warehouse ◽

Data Warehousing ◽

Time Data ◽

Processing Load ◽

Real Time Data ◽

Data Source

In the standard ETL (Extract Processing Load), the data warehouse refreshment must be performed outside of peak hours. i It implies i that the i functioning and i analysis has stopped in their iall actions. iIt causes the iamount of icleanness of i data from the idata Warehouse which iisn't suggesting ithe latest i operational transections. This i issue is i known as i data i latency. The data warehousing is iemployed to ibe a iremedy for ithis iissue. It updates the idata warehouse iat a inear real-time iFashion, instantly after data found from the data source. Therefore, data i latency could i be reduced. Hence the near real time data warehousing was having issues which was not identified in traditional ETL. This paper claims to communicate the issues and accessible options at every point iin the i near real-time i data warehousing, i.e. i The i issues and Available alternatives iare based ion ia literature ireview by additional iStudy that ifocus ion near real-time data iwarehousing issue

Download Full-text

Tuned X-HYBRIDJOIN for Near-Real-Time Data Warehousing

Web Technologies and Applications - Lecture Notes in Computer Science ◽

10.1007/978-3-642-37401-2_49 ◽

2013 ◽

pp. 494-505 ◽

Cited By ~ 2

Author(s):

M. Asif Naeem

Keyword(s):

Real Time ◽

Data Warehousing ◽

Time Data ◽

Real Time Data

Download Full-text

Big Data Management in the Context of Real-Time Data Warehousing

Big Data Management, Technologies, and Applications - Advances in Data Mining and Database Management ◽

10.4018/978-1-4666-4699-5.ch007 ◽

2013 ◽

pp. 150-176

Author(s):

M. Asif Naeem ◽

Gillian Dobbie ◽

Gerald Weber

Keyword(s):

Big Data ◽

Data Integration ◽

Real Time ◽

Real Life ◽

Skewed Distribution ◽

Stream Data ◽

Time Data ◽

Master Data ◽

Real Time Data ◽

Resource Aware

In order to make timely and effective decisions, businesses need the latest information from big data warehouse repositories. To keep these repositories up to date, real-time data integration is required. An important phase in real-time data integration is data transformation where a stream of updates, which is huge in volume and infinite, is joined with large disk-based master data. Stream processing is an important concept in Big Data, since large volumes of data are often best processed immediately. A well-known algorithm called Mesh Join (MESHJOIN) was proposed to process stream data with disk-based master data, which uses limited memory. MESHJOIN is a candidate for a resource-aware system setup. The problem that the authors consider in this chapter is that MESHJOIN is not very selective. In particular, the performance of the algorithm is always inversely proportional to the size of the master data table. As a consequence, the resource consumption is in some scenarios suboptimal. They present an algorithm called Cache Join (CACHEJOIN), which performs asymptotically at least as well as MESHJOIN but performs better in realistic scenarios, particularly if parts of the master data are used with different frequencies. In order to quantify the performance differences, the authors compare both algorithms with a synthetic dataset of a known skewed distribution as well as TPC-H and real-life datasets.

Download Full-text

Bioterrorism Surveillance with Real-Time Data Warehousing

Intelligence and Security Informatics - Lecture Notes in Computer Science ◽

10.1007/3-540-44853-5_24 ◽

2003 ◽

pp. 322-335 ◽

Cited By ~ 12

Author(s):

Donald J. Berndt ◽

Alan R. Hevner ◽

James Studnicki

Keyword(s):

Real Time ◽

Data Warehousing ◽

Time Data ◽

Real Time Data

Download Full-text

Real-Time Data Warehousing: A Rewrite/Merge Approach

Data Warehousing and Knowledge Discovery - Lecture Notes in Computer Science ◽

10.1007/978-3-319-10160-6_8 ◽

2014 ◽

pp. 78-88 ◽

Cited By ~ 2

Author(s):

Alfredo Cuzzocrea ◽

Nickerson Ferreira ◽

Pedro Furtado

Keyword(s):

Real Time ◽

Data Warehousing ◽

Time Data ◽

Real Time Data

Download Full-text

Handling of internal inconsistency OLAP - Based lock table using Message Oriented Middleware in near real time data warehousing

2015 International Seminar on Intelligent Technology and Its Applications (ISITIA) ◽

10.1109/isitia.2015.7220001 ◽

2015 ◽

Author(s):

Ardianto Wibowo ◽

Saiful Akbar

Keyword(s):

Real Time ◽

Data Warehousing ◽

Time Data ◽

Message Oriented Middleware ◽

Real Time Data ◽

Internal Inconsistency

Download Full-text