Optimization and Extension of Stream-Relation Joins

2019 ◽  
Vol 18 (04) ◽  
pp. 1289-1315
Author(s):  
M. Asif Naeem

Online stream processing is an emerging research area in the field of computer science. Semi-stream processing is a particular type of stream processing where a stream of data is processed with a disk-based relation. A semi-stream join operator is required to implement this operation. Many semi-stream joins use a queue of stream tuples to amortize access cost for the disk-based relation, and use an index to allow directed access to the relation, avoiding the loading of unnecessary partition of [Formula: see text]. In such a situation, the question arises which [Formula: see text] partitions should be accessed, as any stream tuple from the queue could serve as a lookup element for accessing the relation index. Existing algorithms use simple safe and correct strategies, but are not optimal in the sense that they maximize the join service rate. This paper makes two contributions: first contribution is in terms of optimization in which we analyze strategies for selecting an appropriate lookup element, particularly for skewed stream data. We show that a good selection strategy can improve service rate of the existing join algorithms significantly. Second contribution is in terms of extension in which we develop multi-stage join for semi-stream join algorithms. Multi-stage join is important when stream data needs to be joined with two or more tables in the relation e.g., stream of sales data needs information to be added from product and customer tables in the relation. To the best of our knowledge, none of the existing algorithms implement this feature. For the service rate evaluation we use two well-performed existing algorithms CACHEJOIN and HYBRIDJOIN. We evaluate the service rate using real, TPC-H, and synthetic datasets with a known skewed distribution. We also present the cost model for our multi-stage join.

2016 ◽  
Vol 12 (3) ◽  
pp. 14-31 ◽  
Author(s):  
M. Asif Naeem ◽  
Imran Sarwar Bajwa ◽  
Noreen Jamil

Stream-based join algorithms got a prominent role in the field of real-time data warehouses. One particular type of stream-based joins is a semi-stream join where a single stream is joined with a disk -based relation. Normally the size of this disk-based relation is large enough and cannot be fit into memory, available for join operator. Therefore, the relation is loaded into memory in partitions. A well-known join algorithm called MESHJOIN (Mesh Join) has been presented in the literature to process semi-stream data. However, the algorithm has some limitations. In particular, MESHJOIN does not consider the characteristics of stream data and therefore does not perform well for skewed stream data. This article introduces the concept of caching and based on that presents a novel algorithm called Cached-based Stream-Relation Join (CSRJ). The algorithm exploits skewed distributions more appropriately, and the authors present results for Zipfian distributions of the type that appear in many applications. They test their algorithm using synthetic, TPC-H and real datasets. Their experiments show that CSRJ forms significantly better than MESHJOIN. They also drive the cost model for their algorithm and based on that they tune the algorithm.


2021 ◽  
Vol 11 (12) ◽  
pp. 5523
Author(s):  
Qian Ye ◽  
Minyan Lu

The main purpose of our provenance research for DSP (distributed stream processing) systems is to analyze abnormal results. Provenance for these systems is not nontrivial because of the ephemerality of stream data and instant data processing mode in modern DSP systems. Challenges include but are not limited to an optimization solution for avoiding excessive runtime overhead, reducing provenance-related data storage, and providing it in an easy-to-use fashion. Without any prior knowledge about which kinds of data may finally lead to the abnormal, we have to track all transformations in detail, which potentially causes hard system burden. This paper proposes s2p (Stream Process Provenance), which mainly consists of online provenance and offline provenance, to provide fine- and coarse-grained provenance in different precision. We base our design of s2p on the fact that, for a mature online DSP system, the abnormal results are rare, and the results that require a detailed analysis are even rarer. We also consider state transition in our provenance explanation. We implement s2p on Apache Flink named as s2p-flink and conduct three experiments to evaluate its scalability, efficiency, and overhead from end-to-end cost, throughput, and space overhead. Our evaluation shows that s2p-flink incurs a 13% to 32% cost overhead, 11% to 24% decline in throughput, and few additional space costs in the online provenance phase. Experiments also demonstrates the s2p-flink can scale well. A case study is presented to demonstrate the feasibility of the whole s2p solution.


2021 ◽  
Author(s):  
Hamed Hasibi ◽  
Saeed Sedighian Kashi

Fog computing brings cloud capabilities closer to the Internet of Things (IoT) devices. IoT devices generate a tremendous amount of stream data towards the cloud via hierarchical fog nodes. To process data streams, many Stream Processing Engines (SPEs) have been developed. Without the fog layer, the stream query processing executes on the cloud, which forwards much traffic toward the cloud. When a hierarchical fog layer is available, a complex query can be divided into simple queries to run on fog nodes by using distributed stream processing. In this paper, we propose an approach to assign stream queries to fog nodes using container technology. We name this approach Stream Queries Placement in Fog (SQPF). Our goal is to minimize end-to-end delay to achieve a better quality of service. At first, in the emulation step, we make docker container instances from SPEs and evaluate their processing delay and throughput under different resource configurations and queries with varying input rates. Then in the placement step, we assign queries among fog nodes by using a genetic algorithm. The practical approach used in SQPF achieves a near-the-best assignment based on the lowest application deadline in real scenarios, and evaluation results are evidence of this goal.


2018 ◽  
Vol 11 (4) ◽  
pp. 90-109
Author(s):  
Mirosław Wylon ◽  
Agnieszka Kempa ◽  
Alicja Słowy ◽  
Justyna Chodkowska-Miszczuk

Summary Subject and purpose of work: Urban transport is a key element of the functioning of urban agglomerations around the world. As it is of strategic importance, the needs of its users have to be diagnosed. Due to the fact that students are the most numerous social group using public transport, particular attention should be paid to students as the real creators of the needs of urban transport. The paper aims to diagnose the challenges in urban transport shaped by the process of studentification based on the case study of Toruń. Materials and methods: The multi-stage research approach was adopted, among others a survey among students. The choice of the research area was determined by the fact that Toruń is one of the largest academic centres in Poland. Results: Toruń is experiencing the effects of the studentification process in different dimensions, including the spatial and transport facets. Conclusions: The majority of students use public transport, daily or several times a week. The most preferred means of transport is the tram owing to its relative speed and punctuality.


Author(s):  
M. Asif Naeem ◽  
Gillian Dobbie ◽  
Gerald Weber

In order to make timely and effective decisions, businesses need the latest information from big data warehouse repositories. To keep these repositories up to date, real-time data integration is required. An important phase in real-time data integration is data transformation where a stream of updates, which is huge in volume and infinite, is joined with large disk-based master data. Stream processing is an important concept in Big Data, since large volumes of data are often best processed immediately. A well-known algorithm called Mesh Join (MESHJOIN) was proposed to process stream data with disk-based master data, which uses limited memory. MESHJOIN is a candidate for a resource-aware system setup. The problem that the authors consider in this chapter is that MESHJOIN is not very selective. In particular, the performance of the algorithm is always inversely proportional to the size of the master data table. As a consequence, the resource consumption is in some scenarios suboptimal. They present an algorithm called Cache Join (CACHEJOIN), which performs asymptotically at least as well as MESHJOIN but performs better in realistic scenarios, particularly if parts of the master data are used with different frequencies. In order to quantify the performance differences, the authors compare both algorithms with a synthetic dataset of a known skewed distribution as well as TPC-H and real-life datasets.


2018 ◽  
Vol 2018 ◽  
pp. 1-7 ◽  
Author(s):  
Emilio Suyama ◽  
Roberto C. Quinino ◽  
Frederico R. B. Cruz

Estimators for the parameters of the Markovian multiserver queues are presented, from samples that are the number of clients in the system at arbitrary points and their sojourn times. As estimation in queues is a recognizably difficult inferential problem, this study focuses on the estimators for the arrival rate, the service rate, and the ratio of these two rates, which is known as the traffic intensity. Simulations are performed to verify the quality of the estimations for sample sizes up to 400. This research also relates notable new insights, for example, that the maximum likelihood estimator for the traffic intensity is equivalent to its moment estimator. Some limitations of the results are presented along with a detailed numerical example and topics for future developments in this research area.


2019 ◽  
Vol 23 (2) ◽  
pp. 555-574 ◽  
Author(s):  
Xiaohui Wei ◽  
Yuan Zhuang ◽  
Hongliang Li ◽  
Zhiliang Liu

2019 ◽  
Vol 30 (3) ◽  
pp. 657-675 ◽  
Author(s):  
Anand Jaiswal ◽  
Cherian Samuel ◽  
Chirag Chandan Mishra

Purpose The purpose of this paper is to provide a traffic route selection strategy based on minimum carbon dioxide (CO2) emission by vehicles over different route choices. Design/methodology/approach The study used queuing theory for Markovian M/M/1 model over the road junctions to assess total time spent over each of the junctions for a route with junctions in tandem. With parameters of distance, mean service rate at the junction, the number of junctions and fuel consumption rate, which is a function of variable average speed, the CO2 emission is estimated over each of the junction in tandem and collectively over each of the routes. Findings The outcome of the study is a mathematical formulation, using queuing theory to estimate CO2 emissions over different route choices. Research finding estimated total time spent and subsequent CO2 emission for mean arrival rates of vehicles at junctions in tandem. The model is validated with a pilot study, and the result shows the best vehicular route choice with minimum CO2 emissions. Research limitations/implications Proposed study is limited to M/M/1 model at each of the junction, with no defection of vehicles. The study is also limited to a constant mean arrival rate at each of the junction. Practical implications The work can be used to define strategies to route vehicles on different route choices to reduce minimum vehicular CO2 emissions. Originality/value Proposed work gives a solution for minimising carbon emission over routes with unsignalised junctions in the tandem network.


10.29007/8lbk ◽  
2019 ◽  
Author(s):  
Kasumi Kato ◽  
Atsuko Takefusa ◽  
Hidemoto Nakada ◽  
Masato Oguchi

The spread of various sensors and the development of cloud computing technologies en- able the accumulation and use of large numbers of live logs in ordinary homes. To operate a service that utilizes sensor data, it is difficult to install servers and storage in ordinary homes and to analyze the collected data from sensors. Those data are typically transmitted from sensors to a cloud and analyzed in the cloud. However, services that involve moving image analysis must transfer large amounts of data continuously and require high computing power for analysis. Hence, it is highly difficult to process them in real time in the cloud using a conventional stream data processing framework. In this research, we propose a construction scheme for a highly efficient distributed stream processing infrastructure that enables scalable processing of moving image recognition tasks according to the amount of data that are transmitted from sensors. We implement a prototype system of the proposed distributed stream processing infrastructure using Ray and Apache Kafka, which is a distributed messaging system, and we evaluate its performance. The experimental results demonstrate that the proposed distributed stream processing infrastructure is highly scalable.


Sign in / Sign up

Export Citation Format

Share Document