A Cached-Based Stream-Relation Join Operator for Semi-Stream Data Processing

2016 ◽  
Vol 12 (3) ◽  
pp. 14-31 ◽  
Author(s):  
M. Asif Naeem ◽  
Imran Sarwar Bajwa ◽  
Noreen Jamil

Stream-based join algorithms got a prominent role in the field of real-time data warehouses. One particular type of stream-based joins is a semi-stream join where a single stream is joined with a disk -based relation. Normally the size of this disk-based relation is large enough and cannot be fit into memory, available for join operator. Therefore, the relation is loaded into memory in partitions. A well-known join algorithm called MESHJOIN (Mesh Join) has been presented in the literature to process semi-stream data. However, the algorithm has some limitations. In particular, MESHJOIN does not consider the characteristics of stream data and therefore does not perform well for skewed stream data. This article introduces the concept of caching and based on that presents a novel algorithm called Cached-based Stream-Relation Join (CSRJ). The algorithm exploits skewed distributions more appropriately, and the authors present results for Zipfian distributions of the type that appear in many applications. They test their algorithm using synthetic, TPC-H and real datasets. Their experiments show that CSRJ forms significantly better than MESHJOIN. They also drive the cost model for their algorithm and based on that they tune the algorithm.

2020 ◽  
Vol 10 (24) ◽  
pp. 9154
Author(s):  
Paula Morella ◽  
María Pilar Lambán ◽  
Jesús Royo ◽  
Juan Carlos Sánchez ◽  
Jaime Latapia

The purpose of this work is to develop a new Key Performance Indicator (KPI) that can quantify the cost of Six Big Losses developed by Nakajima and implements it in a Cyber Physical System (CPS), achieving a real-time monitorization of the KPI. This paper follows the methodology explained below. A cost model has been used to accurately develop this indicator together with the Six Big Losses description. At the same time, the machine tool has been integrated into a CPS, enhancing the real-time data acquisition, using the Industry 4.0 technologies. Once the KPI has been defined, we have developed the software that can turn these real-time data into relevant information (using Python) through the calculation of our indicator. Finally, we have carried out a case of study showing our new KPI results and comparing them to other indicators related with the Six Big Losses but in different dimensions. As a result, our research quantifies economically the Six Big Losses, enhances the detection of the bigger ones to improve them, and enlightens the importance of paying attention to different dimensions, mainly, the productive, sustainable, and economic at the same time.


Electronics ◽  
2020 ◽  
Vol 9 (8) ◽  
pp. 1299
Author(s):  
M. Asif Naeem ◽  
Habib Khan ◽  
Saad Aslam ◽  
Noreen Jamil

Near real-time data warehousing is an important area of research, as business organisations want to analyse their businesses sales with minimal latency. Therefore, sales data generated by data sources need to reflect immediately in the data warehouse. This requires near-real-time transformation of the stream of sales data with a disk-based relation called master data in the staging area. For this purpose, a stream-relation join is required. The main problem in stream-relation joins is the different nature of inputs; stream data is fast and bursty, whereas the disk-based relation is slow due to high disk I/O cost. To resolve this problem, a famous algorithm CACHEJOIN (cache join) was published in the literature. The algorithm has two phases, the disk-probing phase and the stream-probing phase. These two phases execute sequentially; that means stream tuples wait unnecessarily due to the sequential execution of both phases. This limits the algorithm to exploiting CPU resources optimally. In this paper, we address this issue by presenting a robust algorithm called PCSRJ (parallelised cache-based stream relation join). The new algorithm enables the execution of both disk-probing and stream-probing phases of CACHEJOIN in parallel. The algorithm distributes the disk-based relation on two separate nodes and enables parallel execution of CACHEJOIN on each node. The algorithm also implements a strategy of splitting the stream data on each node depending on the relevant part of the relation. We developed a cost model for PCSRJ and validated it empirically. We compared the service rates of both algorithms using a synthetic dataset. Our experiments showed that PCSRJ significantly outperforms CACHEJOIN.


Author(s):  
M. Asif Naeem ◽  
Noreen Jamil

Stream-based join algorithms are a promising technology for modern real-time data warehouses. A particular category of stream-based joins is a semi-stream join where a single stream is joined with a disk based master data. The join operator typically works under limited main memory and this memory is generally not large enough to hold the whole disk-based master data. Recently, a seminal join algorithm called MESHJOIN (Mesh Join) has been proposed in the literature to process semi-stream data. MESHJOIN is a candidate for a resource-aware system setup. However, MESHJOIN is not very selective. In particular, MESHJOIN does not consider the characteristics of stream data and its performance is suboptimal for skewed stream data. This chapter presents a novel Cached-based Semi-Stream Join (CSSJ) using a cache module. The algorithm is more appropriate for skewed distributions, and we present results for Zipfian distributions of the type that appear in many applications. We conduct a rigorous experimental study to test our algorithm. Our experiments show that CSSJ outperforms MESHJOIN significantly. We also present the cost model for our CSSJ and validate it with experiments.


2022 ◽  
Author(s):  
M. Asif Naeem ◽  
Wasiullah Waqar ◽  
Farhaan Mirza ◽  
Ali Tahir

Abstract Semi-stream join is an emerging research problem in the domain of near-real-time data warehousing. A semi-stream join is basically a join between a fast stream (S) and a slow disk-based relation (R). In the modern era of technology, huge amounts of data are being generated swiftly on a daily basis which needs to be instantly analyzed for making successful business decisions. Keeping this in mind, a famous algorithm called CACHEJOIN (Cache Join) was proposed. The limitation of the CACHEJOIN algorithm is that it does not deal with the frequently changing trends in a stream data efficiently. To overcome this limitation, in this paper we propose a TinyLFU-CACHEJOIN algorithm, a modified version of the original CACHEJOIN algorithm, which is designed to enhance the performance of a CACHEJOIN algorithm. TinyLFU-CACHEJOIN employs an intelligent strategy which keeps only those records of $R$ in the cache that have a high hit rate in S. This mechanism of TinyLFU-CACHEJOIN allows it to deal with the sudden and abrupt trend changes in S. We developed a cost model for our TinyLFU-CACHEJOIN algorithm and proved it empirically. We also assessed the performance of our proposed TinyLFU-CACHEJOIN algorithm with the existing CACHEJOIN algorithm on a skewed synthetic dataset. The experiments proved that TinyLFU-CACHEJOIN algorithm significantly outperforms the CACHEJOIN algorithm.


2020 ◽  
Vol 31 (1) ◽  
pp. 20-37 ◽  
Author(s):  
M. Asif Naeem ◽  
Erum Mehmood ◽  
M. G. Abbas Malik ◽  
Noreen Jamil

Streaming data join is a critical process in the field of near-real-time data warehousing. For this purpose, an adaptive semi-stream join algorithm called CACHEJOIN (Cache Join) focusing non-uniform stream data is provided in the literature. However, this algorithm cannot exploit the memory and CPU resources optimally and consequently it leaves its service rate suboptimal due to sequential execution of both of its phases, called stream-probing (SP) phase and disk-probing (DP) phase. By integrating the advantages of CACHEJOIN, this article presents two modifications for it. The first is called P-CACHEJOIN (Parallel Cache Join) that enables the parallel processing of two phases in CACHEJOIN. This increases number of joined stream records and therefore improves throughput considerably. The second is called OP-CACHEJOIN (Optimized Parallel Cache Join) that implements a parallel loading of stored data into memory while the DP phase is executing. This research presents the performance analysis of both of the approaches defined within the paper existing CACHEJOIN empirically using synthetic skewed dataset.


2011 ◽  
Vol 121-126 ◽  
pp. 3195-3199
Author(s):  
Li Feng Yang ◽  
Jun Yuan ◽  
Wei Na Liu ◽  
Xiu Ming Nie ◽  
Xue Liang Pei

Use Kingview to acquire and display the centrifugal pump performance parameters for the real-time data, and will stored the collected experimental data in Access databases, using VB database read, and drawing function for the data processing and rendering performance parameters of relationship curves.


Sign in / Sign up

Export Citation Format

Share Document