Big Data Management in the Context of Real-Time Data Warehousing

Author(s):  
M. Asif Naeem ◽  
Gillian Dobbie ◽  
Gerald Weber

In order to make timely and effective decisions, businesses need the latest information from big data warehouse repositories. To keep these repositories up to date, real-time data integration is required. An important phase in real-time data integration is data transformation where a stream of updates, which is huge in volume and infinite, is joined with large disk-based master data. Stream processing is an important concept in Big Data, since large volumes of data are often best processed immediately. A well-known algorithm called Mesh Join (MESHJOIN) was proposed to process stream data with disk-based master data, which uses limited memory. MESHJOIN is a candidate for a resource-aware system setup. The problem that the authors consider in this chapter is that MESHJOIN is not very selective. In particular, the performance of the algorithm is always inversely proportional to the size of the master data table. As a consequence, the resource consumption is in some scenarios suboptimal. They present an algorithm called Cache Join (CACHEJOIN), which performs asymptotically at least as well as MESHJOIN but performs better in realistic scenarios, particularly if parts of the master data are used with different frequencies. In order to quantify the performance differences, the authors compare both algorithms with a synthetic dataset of a known skewed distribution as well as TPC-H and real-life datasets.

Author(s):  
M. Asif Naeem ◽  
Noreen Jamil

Stream-based join algorithms are a promising technology for modern real-time data warehouses. A particular category of stream-based joins is a semi-stream join where a single stream is joined with a disk based master data. The join operator typically works under limited main memory and this memory is generally not large enough to hold the whole disk-based master data. Recently, a seminal join algorithm called MESHJOIN (Mesh Join) has been proposed in the literature to process semi-stream data. MESHJOIN is a candidate for a resource-aware system setup. However, MESHJOIN is not very selective. In particular, MESHJOIN does not consider the characteristics of stream data and its performance is suboptimal for skewed stream data. This chapter presents a novel Cached-based Semi-Stream Join (CSSJ) using a cache module. The algorithm is more appropriate for skewed distributions, and we present results for Zipfian distributions of the type that appear in many applications. We conduct a rigorous experimental study to test our algorithm. Our experiments show that CSSJ outperforms MESHJOIN significantly. We also present the cost model for our CSSJ and validate it with experiments.


Electronics ◽  
2020 ◽  
Vol 9 (8) ◽  
pp. 1299
Author(s):  
M. Asif Naeem ◽  
Habib Khan ◽  
Saad Aslam ◽  
Noreen Jamil

Near real-time data warehousing is an important area of research, as business organisations want to analyse their businesses sales with minimal latency. Therefore, sales data generated by data sources need to reflect immediately in the data warehouse. This requires near-real-time transformation of the stream of sales data with a disk-based relation called master data in the staging area. For this purpose, a stream-relation join is required. The main problem in stream-relation joins is the different nature of inputs; stream data is fast and bursty, whereas the disk-based relation is slow due to high disk I/O cost. To resolve this problem, a famous algorithm CACHEJOIN (cache join) was published in the literature. The algorithm has two phases, the disk-probing phase and the stream-probing phase. These two phases execute sequentially; that means stream tuples wait unnecessarily due to the sequential execution of both phases. This limits the algorithm to exploiting CPU resources optimally. In this paper, we address this issue by presenting a robust algorithm called PCSRJ (parallelised cache-based stream relation join). The new algorithm enables the execution of both disk-probing and stream-probing phases of CACHEJOIN in parallel. The algorithm distributes the disk-based relation on two separate nodes and enables parallel execution of CACHEJOIN on each node. The algorithm also implements a strategy of splitting the stream data on each node depending on the relevant part of the relation. We developed a cost model for PCSRJ and validated it empirically. We compared the service rates of both algorithms using a synthetic dataset. Our experiments showed that PCSRJ significantly outperforms CACHEJOIN.


Author(s):  
Rizwan Patan ◽  
Rajasekhara Babu M ◽  
Suresh Kallam

A Big Data Stream Computing (BDSC) Platform handles real-time data from various applications such as risk management, marketing management and business intelligence. Now a days Internet of Things (IoT) deployment is increasing massively in all the areas. These IoTs engender real-time data for analysis. Existing BDSC is inefficient to handle Real-data stream from IoTs because the data stream from IoTs is unstructured and has inconstant velocity. So, it is challenging to handle such real-time data stream. This work proposes a framework that handles real-time data stream through device control techniques to improve the performance. The frame work includes three layers. First layer deals with Big Data platforms that handles real data streams based on area of importance. Second layer is performance layer which deals with performance issues such as low response time, and energy efficiency. The third layer is meant for Applying developed method on existing BDSC platform. The experimental results have been shown a performance improvement 20%-30% for real time data stream from IoT application.


Author(s):  
Amitava Choudhury ◽  
Kalpana Rangra

Data type and amount in human society is growing at an amazing speed, which is caused by emerging new services such as cloud computing, internet of things, and location-based services. The era of big data has arrived. As data has been a fundamental resource, how to manage and utilize big data better has attracted much attention. Especially with the development of the internet of things, how to process a large amount of real-time data has become a great challenge in research and applications. Recently, cloud computing technology has attracted much attention to high performance, but how to use cloud computing technology for large-scale real-time data processing has not been studied. In this chapter, various big data processing techniques are discussed.


2021 ◽  
Vol 4 ◽  
Author(s):  
Logan Froese ◽  
Joshua Dian ◽  
Carleen Batson ◽  
Alwyn Gomez ◽  
Amanjyot Singh Sainbhi ◽  
...  

Introduction: As real time data processing is integrated with medical care for traumatic brain injury (TBI) patients, there is a requirement for devices to have digital output. However, there are still many devices that fail to have the required hardware to export real time data into an acceptable digital format or in a continuously updating manner. This is particularly the case for many intravenous pumps and older technological systems. Such accurate and digital real time data integration within TBI care and other fields is critical as we move towards digitizing healthcare information and integrating clinical data streams to improve bedside care. We propose to address this gap in technology by building a system that employs Optical Character Recognition through computer vision, using real time images from a pump monitor to extract the desired real time information.Methods: Using freely available software and readily available technology, we built a script that extracts real time images from a medication pump and then processes them using Optical Character Recognition to create digital text from the image. This text was then transferred to an ICM + real-time monitoring software in parallel with other retrieved physiological data.Results: The prototype that was built works effectively for our device, with source code openly available to interested end-users. However, future work is required for a more universal application of such a system.Conclusion: Advances here can improve medical information collection in the clinical environment, eliminating human error with bedside charting, and aid in data integration for biomedical research where many complex data sets can be seamlessly integrated digitally. Our design demonstrates a simple adaptation of current technology to help with this integration.


Sensors ◽  
2018 ◽  
Vol 18 (9) ◽  
pp. 2994 ◽  
Author(s):  
Bhagya Silva ◽  
Murad Khan ◽  
Changsu Jung ◽  
Jihun Seo ◽  
Diyan Muhammad ◽  
...  

The Internet of Things (IoT), inspired by the tremendous growth of connected heterogeneous devices, has pioneered the notion of smart city. Various components, i.e., smart transportation, smart community, smart healthcare, smart grid, etc. which are integrated within smart city architecture aims to enrich the quality of life (QoL) of urban citizens. However, real-time processing requirements and exponential data growth withhold smart city realization. Therefore, herein we propose a Big Data analytics (BDA)-embedded experimental architecture for smart cities. Two major aspects are served by the BDA-embedded smart city. Firstly, it facilitates exploitation of urban Big Data (UBD) in planning, designing, and maintaining smart cities. Secondly, it occupies BDA to manage and process voluminous UBD to enhance the quality of urban services. Three tiers of the proposed architecture are liable for data aggregation, real-time data management, and service provisioning. Moreover, offline and online data processing tasks are further expedited by integrating data normalizing and data filtering techniques to the proposed work. By analyzing authenticated datasets, we obtained the threshold values required for urban planning and city operation management. Performance metrics in terms of online and offline data processing for the proposed dual-node Hadoop cluster is obtained using aforementioned authentic datasets. Throughput and processing time analysis performed with regard to existing works guarantee the performance superiority of the proposed work. Hence, we can claim the applicability and reliability of implementing proposed BDA-embedded smart city architecture in the real world.


Sign in / Sign up

Export Citation Format

Share Document