A Cached-Based Stream-Relation Join Operator for Semi-Stream Data Processing

Stream-based join algorithms got a prominent role in the field of real-time data warehouses. One particular type of stream-based joins is a semi-stream join where a single stream is joined with a disk -based relation. Normally the size of this disk-based relation is large enough and cannot be fit into memory, available for join operator. Therefore, the relation is loaded into memory in partitions. A well-known join algorithm called MESHJOIN (Mesh Join) has been presented in the literature to process semi-stream data. However, the algorithm has some limitations. In particular, MESHJOIN does not consider the characteristics of stream data and therefore does not perform well for skewed stream data. This article introduces the concept of caching and based on that presents a novel algorithm called Cached-based Stream-Relation Join (CSRJ). The algorithm exploits skewed distributions more appropriately, and the authors present results for Zipfian distributions of the type that appear in many applications. They test their algorithm using synthetic, TPC-H and real datasets. Their experiments show that CSRJ forms significantly better than MESHJOIN. They also drive the cost model for their algorithm and based on that they tune the algorithm.

Download Full-text

Development of a New KPI for the Economic Quantification of Six Big Losses and Its Implementation in a Cyber Physical System

Applied Sciences ◽

10.3390/app10249154 ◽

2020 ◽

Vol 10 (24) ◽

pp. 9154

Author(s):

Paula Morella ◽

María Pilar Lambán ◽

Jesús Royo ◽

Juan Carlos Sánchez ◽

Jaime Latapia

Keyword(s):

Real Time ◽

Physical System ◽

Performance Indicator ◽

Cost Model ◽

Relevant Information ◽

Cyber Physical System ◽

Time Data ◽

Real Time Data ◽

Different Dimensions ◽

The Cost

The purpose of this work is to develop a new Key Performance Indicator (KPI) that can quantify the cost of Six Big Losses developed by Nakajima and implements it in a Cyber Physical System (CPS), achieving a real-time monitorization of the KPI. This paper follows the methodology explained below. A cost model has been used to accurately develop this indicator together with the Six Big Losses description. At the same time, the machine tool has been integrated into a CPS, enhancing the real-time data acquisition, using the Industry 4.0 technologies. Once the KPI has been defined, we have developed the software that can turn these real-time data into relevant information (using Python) through the calculation of our indicator. Finally, we have carried out a case of study showing our new KPI results and comparing them to other indicators related with the Six Big Losses but in different dimensions. As a result, our research quantifies economically the Six Big Losses, enhances the detection of the bigger ones to improve them, and enlightens the importance of paying attention to different dimensions, mainly, the productive, sustainable, and economic at the same time.

Download Full-text

Parallelisation of a Cache-Based Stream-Relation Join for a Near-Real-Time Data Warehouse

Electronics ◽

10.3390/electronics9081299 ◽

2020 ◽

Vol 9 (8) ◽

pp. 1299

Author(s):

M. Asif Naeem ◽

Habib Khan ◽

Saad Aslam ◽

Noreen Jamil

Keyword(s):

Real Time ◽

Data Warehouse ◽

Cost Model ◽

Parallel Execution ◽

Stream Data ◽

Time Data ◽

Sales Data ◽

Master Data ◽

Real Time Data ◽

Two Phases

Near real-time data warehousing is an important area of research, as business organisations want to analyse their businesses sales with minimal latency. Therefore, sales data generated by data sources need to reflect immediately in the data warehouse. This requires near-real-time transformation of the stream of sales data with a disk-based relation called master data in the staging area. For this purpose, a stream-relation join is required. The main problem in stream-relation joins is the different nature of inputs; stream data is fast and bursty, whereas the disk-based relation is slow due to high disk I/O cost. To resolve this problem, a famous algorithm CACHEJOIN (cache join) was published in the literature. The algorithm has two phases, the disk-probing phase and the stream-probing phase. These two phases execute sequentially; that means stream tuples wait unnecessarily due to the sequential execution of both phases. This limits the algorithm to exploiting CPU resources optimally. In this paper, we address this issue by presenting a robust algorithm called PCSRJ (parallelised cache-based stream relation join). The new algorithm enables the execution of both disk-probing and stream-probing phases of CACHEJOIN in parallel. The algorithm distributes the disk-based relation on two separate nodes and enables parallel execution of CACHEJOIN on each node. The algorithm also implements a strategy of splitting the stream data on each node depending on the relevant part of the relation. We developed a cost model for PCSRJ and validated it empirically. We compared the service rates of both algorithms using a synthetic dataset. Our experiments showed that PCSRJ significantly outperforms CACHEJOIN.

Download Full-text

Online Processing of End-User Data in Real-Time Data Warehousing

Improving Knowledge Discovery through the Integration of Data Mining Techniques - Advances in Data Mining and Database Management ◽

10.4018/978-1-4666-8513-0.ch002 ◽

2015 ◽

pp. 13-31

Author(s):

M. Asif Naeem ◽

Noreen Jamil

Keyword(s):

Real Time ◽

Cost Model ◽

Main Memory ◽

Stream Data ◽

Time Data ◽

Online Processing ◽

Master Data ◽

Real Time Data ◽

User Data ◽

Resource Aware

Stream-based join algorithms are a promising technology for modern real-time data warehouses. A particular category of stream-based joins is a semi-stream join where a single stream is joined with a disk based master data. The join operator typically works under limited main memory and this memory is generally not large enough to hold the whole disk-based master data. Recently, a seminal join algorithm called MESHJOIN (Mesh Join) has been proposed in the literature to process semi-stream data. MESHJOIN is a candidate for a resource-aware system setup. However, MESHJOIN is not very selective. In particular, MESHJOIN does not consider the characteristics of stream data and its performance is suboptimal for skewed stream data. This chapter presents a novel Cached-based Semi-Stream Join (CSSJ) using a cache module. The algorithm is more appropriate for skewed distributions, and we present results for Zipfian distributions of the type that appear in many applications. We conduct a rigorous experimental study to test our algorithm. Our experiments show that CSSJ outperforms MESHJOIN significantly. We also present the cost model for our CSSJ and validate it with experiments.

Download Full-text

TinyLFU-Based Semi-Stream Cache Join for Near-Real-Time Data Warehousing

10.21203/rs.3.rs-944044/v1 ◽

2022 ◽

Author(s):

M. Asif Naeem ◽

Wasiullah Waqar ◽

Farhaan Mirza ◽

Ali Tahir

Keyword(s):

Real Time ◽

Data Warehousing ◽

Cost Model ◽

Research Problem ◽

Daily Basis ◽

Stream Data ◽

Time Data ◽

Business Decisions ◽

Real Time Data ◽

Modern Era

Abstract Semi-stream join is an emerging research problem in the domain of near-real-time data warehousing. A semi-stream join is basically a join between a fast stream (S) and a slow disk-based relation (R). In the modern era of technology, huge amounts of data are being generated swiftly on a daily basis which needs to be instantly analyzed for making successful business decisions. Keeping this in mind, a famous algorithm called CACHEJOIN (Cache Join) was proposed. The limitation of the CACHEJOIN algorithm is that it does not deal with the frequently changing trends in a stream data efficiently. To overcome this limitation, in this paper we propose a TinyLFU-CACHEJOIN algorithm, a modified version of the original CACHEJOIN algorithm, which is designed to enhance the performance of a CACHEJOIN algorithm. TinyLFU-CACHEJOIN employs an intelligent strategy which keeps only those records of $R$ in the cache that have a high hit rate in S. This mechanism of TinyLFU-CACHEJOIN allows it to deal with the sudden and abrupt trend changes in S. We developed a cost model for our TinyLFU-CACHEJOIN algorithm and proved it empirically. We also assessed the performance of our proposed TinyLFU-CACHEJOIN algorithm with the existing CACHEJOIN algorithm on a skewed synthetic dataset. The experiments proved that TinyLFU-CACHEJOIN algorithm significantly outperforms the CACHEJOIN algorithm.

Download Full-text

Optimizing Semi-Stream CACHEJOIN for Near-Real- Time Data Warehousing

Journal of Database Management ◽

10.4018/jdm.2020010102 ◽

2020 ◽

Vol 31 (1) ◽

pp. 20-37 ◽

Cited By ~ 1

Author(s):

M. Asif Naeem ◽

Erum Mehmood ◽

M. G. Abbas Malik ◽

Noreen Jamil

Keyword(s):

Real Time ◽

Data Warehousing ◽

Streaming Data ◽

Service Rate ◽

Stream Data ◽

Time Data ◽

Join Algorithm ◽

Real Time Data ◽

Two Phases ◽

Critical Process

Streaming data join is a critical process in the field of near-real-time data warehousing. For this purpose, an adaptive semi-stream join algorithm called CACHEJOIN (Cache Join) focusing non-uniform stream data is provided in the literature. However, this algorithm cannot exploit the memory and CPU resources optimally and consequently it leaves its service rate suboptimal due to sequential execution of both of its phases, called stream-probing (SP) phase and disk-probing (DP) phase. By integrating the advantages of CACHEJOIN, this article presents two modifications for it. The first is called P-CACHEJOIN (Parallel Cache Join) that enables the parallel processing of two phases in CACHEJOIN. This increases number of joined stream records and therefore improves throughput considerably. The second is called OP-CACHEJOIN (Optimized Parallel Cache Join) that implements a parallel loading of stored data into memory while the DP phase is executing. This research presents the performance analysis of both of the approaches defined within the paper existing CACHEJOIN empirically using synthetic skewed dataset.

Download Full-text

Executing Complex Calculations in the Cloud to Enable Real-Time Data Processing

2020 IEEE Intl Conf on Dependable, Autonomic and Secure Computing, Intl Conf on Pervasive Intelligence and Computing, Intl Conf on Cloud and Big Data Computing, Intl Conf on Cyber Science and Technology Congress (DASC/PiCom/CBDCom/CyberSciTech) ◽

10.1109/dasc-picom-cbdcom-cyberscitech49142.2020.00118 ◽

2020 ◽

Author(s):

Manuel Leibetseder ◽

Marc Kurz ◽

Erik Sonnleitner ◽

Thomas Rittenschober

Keyword(s):

Data Processing ◽

Real Time ◽

Time Data ◽

Real Time Data ◽

Real Time Data Processing

Download Full-text

Configuration Pump Performance Test System Design Based on Kingview

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.121-126.3195 ◽

2011 ◽

Vol 121-126 ◽

pp. 3195-3199

Author(s):

Li Feng Yang ◽

Jun Yuan ◽

Wei Na Liu ◽

Xiu Ming Nie ◽

Xue Liang Pei

Keyword(s):

Experimental Data ◽

Data Processing ◽

System Design ◽

Centrifugal Pump ◽

Performance Test ◽

Test System ◽

Performance Parameters ◽

Time Data ◽

Pump Performance ◽

Real Time Data

Use Kingview to acquire and display the centrifugal pump performance parameters for the real-time data, and will stored the collected experimental data in Access databases, using VB database read, and drawing function for the data processing and rendering performance parameters of relationship curves.

Download Full-text