Big Data Management in the Context of Real-Time Data Warehousing

In order to make timely and effective decisions, businesses need the latest information from big data warehouse repositories. To keep these repositories up to date, real-time data integration is required. An important phase in real-time data integration is data transformation where a stream of updates, which is huge in volume and infinite, is joined with large disk-based master data. Stream processing is an important concept in Big Data, since large volumes of data are often best processed immediately. A well-known algorithm called Mesh Join (MESHJOIN) was proposed to process stream data with disk-based master data, which uses limited memory. MESHJOIN is a candidate for a resource-aware system setup. The problem that the authors consider in this chapter is that MESHJOIN is not very selective. In particular, the performance of the algorithm is always inversely proportional to the size of the master data table. As a consequence, the resource consumption is in some scenarios suboptimal. They present an algorithm called Cache Join (CACHEJOIN), which performs asymptotically at least as well as MESHJOIN but performs better in realistic scenarios, particularly if parts of the master data are used with different frequencies. In order to quantify the performance differences, the authors compare both algorithms with a synthetic dataset of a known skewed distribution as well as TPC-H and real-life datasets.

Download Full-text

Online Processing of End-User Data in Real-Time Data Warehousing

Improving Knowledge Discovery through the Integration of Data Mining Techniques - Advances in Data Mining and Database Management ◽

10.4018/978-1-4666-8513-0.ch002 ◽

2015 ◽

pp. 13-31

Author(s):

M. Asif Naeem ◽

Noreen Jamil

Keyword(s):

Real Time ◽

Cost Model ◽

Main Memory ◽

Stream Data ◽

Time Data ◽

Online Processing ◽

Master Data ◽

Real Time Data ◽

User Data ◽

Resource Aware

Stream-based join algorithms are a promising technology for modern real-time data warehouses. A particular category of stream-based joins is a semi-stream join where a single stream is joined with a disk based master data. The join operator typically works under limited main memory and this memory is generally not large enough to hold the whole disk-based master data. Recently, a seminal join algorithm called MESHJOIN (Mesh Join) has been proposed in the literature to process semi-stream data. MESHJOIN is a candidate for a resource-aware system setup. However, MESHJOIN is not very selective. In particular, MESHJOIN does not consider the characteristics of stream data and its performance is suboptimal for skewed stream data. This chapter presents a novel Cached-based Semi-Stream Join (CSSJ) using a cache module. The algorithm is more appropriate for skewed distributions, and we present results for Zipfian distributions of the type that appear in many applications. We conduct a rigorous experimental study to test our algorithm. Our experiments show that CSSJ outperforms MESHJOIN significantly. We also present the cost model for our CSSJ and validate it with experiments.

Download Full-text

Parallelisation of a Cache-Based Stream-Relation Join for a Near-Real-Time Data Warehouse

Electronics ◽

10.3390/electronics9081299 ◽

2020 ◽

Vol 9 (8) ◽

pp. 1299

Author(s):

M. Asif Naeem ◽

Habib Khan ◽

Saad Aslam ◽

Noreen Jamil

Keyword(s):

Real Time ◽

Data Warehouse ◽

Cost Model ◽

Parallel Execution ◽

Stream Data ◽

Time Data ◽

Sales Data ◽

Master Data ◽

Real Time Data ◽

Two Phases

Near real-time data warehousing is an important area of research, as business organisations want to analyse their businesses sales with minimal latency. Therefore, sales data generated by data sources need to reflect immediately in the data warehouse. This requires near-real-time transformation of the stream of sales data with a disk-based relation called master data in the staging area. For this purpose, a stream-relation join is required. The main problem in stream-relation joins is the different nature of inputs; stream data is fast and bursty, whereas the disk-based relation is slow due to high disk I/O cost. To resolve this problem, a famous algorithm CACHEJOIN (cache join) was published in the literature. The algorithm has two phases, the disk-probing phase and the stream-probing phase. These two phases execute sequentially; that means stream tuples wait unnecessarily due to the sequential execution of both phases. This limits the algorithm to exploiting CPU resources optimally. In this paper, we address this issue by presenting a robust algorithm called PCSRJ (parallelised cache-based stream relation join). The new algorithm enables the execution of both disk-probing and stream-probing phases of CACHEJOIN in parallel. The algorithm distributes the disk-based relation on two separate nodes and enables parallel execution of CACHEJOIN on each node. The algorithm also implements a strategy of splitting the stream data on each node depending on the relevant part of the relation. We developed a cost model for PCSRJ and validated it empirically. We compared the service rates of both algorithms using a synthetic dataset. Our experiments showed that PCSRJ significantly outperforms CACHEJOIN.

Download Full-text

Performance Improvement IoT Applications Through Multimedia Analytics Using Big Data Stream Computing Platforms

Exploring the Convergence of Big Data and the Internet of Things - Advances in Data Mining and Database Management ◽

10.4018/978-1-5225-2947-7.ch015 ◽

2018 ◽

pp. 200-221

Author(s):

Rizwan Patan ◽

Rajasekhara Babu M ◽

Suresh Kallam

Keyword(s):

Big Data ◽

Real Time ◽

Performance Improvement ◽

Data Stream ◽

Real Data ◽

Stream Computing ◽

Time Data ◽

Real Time Data ◽

Computing Platforms ◽

Time And Energy

A Big Data Stream Computing (BDSC) Platform handles real-time data from various applications such as risk management, marketing management and business intelligence. Now a days Internet of Things (IoT) deployment is increasing massively in all the areas. These IoTs engender real-time data for analysis. Existing BDSC is inefficient to handle Real-data stream from IoTs because the data stream from IoTs is unstructured and has inconstant velocity. So, it is challenging to handle such real-time data stream. This work proposes a framework that handles real-time data stream through device control techniques to improve the performance. The frame work includes three layers. First layer deals with Big Data platforms that handles real data streams based on area of importance. Second layer is performance layer which deals with performance issues such as low response time, and energy efficiency. The third layer is meant for Applying developed method on existing BDSC platform. The experimental results have been shown a performance improvement 20%-30% for real time data stream from IoT application.

Download Full-text

Trends and Technologies in Big Data Processing

Advances in Computational Intelligence and Robotics - Innovations, Algorithms, and Applications in Cognitive Informatics and Natural Intelligence ◽

10.4018/978-1-7998-3038-2.ch002 ◽

2020 ◽

pp. 17-42

Author(s):

Amitava Choudhury ◽

Kalpana Rangra

Keyword(s):

Cloud Computing ◽

Big Data ◽

Internet Of Things ◽

Data Processing ◽

Real Time ◽

Computing Technology ◽

Time Data ◽

Big Data Processing ◽

Real Time Data ◽

Real Time Data Processing

Data type and amount in human society is growing at an amazing speed, which is caused by emerging new services such as cloud computing, internet of things, and location-based services. The era of big data has arrived. As data has been a fundamental resource, how to manage and utilize big data better has attracted much attention. Especially with the development of the internet of things, how to process a large amount of real-time data has become a great challenge in research and applications. Recently, cloud computing technology has attracted much attention to high performance, but how to use cloud computing technology for large-scale real-time data processing has not been studied. In this chapter, various big data processing techniques are discussed.

Download Full-text

A novel performance aware real-time data handling for big data platforms on Lambda architecture

International Journal of Computer Aided Engineering and Technology ◽

10.1504/ijcaet.2018.092840 ◽

2018 ◽

Vol 10 (4) ◽

pp. 418 ◽

Cited By ~ 1

Author(s):

Rizwan Patan ◽

M. Rajasekhara Babu

Keyword(s):

Big Data ◽

Real Time ◽

Data Handling ◽

Time Data ◽

Lambda Architecture ◽

Real Time Data

Download Full-text

A Real Time Data Association Prototype System for Multi-Tenants in Big Data

Journal of Physics Conference Series ◽

10.1088/1742-6596/1176/2/022014 ◽

2019 ◽

Vol 1176 ◽

pp. 022014

Author(s):

Ge Fu ◽

Yu Ding ◽

Siyu Jia

Keyword(s):

Big Data ◽

Real Time ◽

Data Association ◽

Prototype System ◽

Time Data ◽

Real Time Data

Download Full-text

Introduction to Real-Time Data Integration

Managing Data in Motion ◽

10.1016/b978-0-12-397167-8.00011-x ◽

2013 ◽

pp. 77-78

Author(s):

April Reeve

Keyword(s):

Data Integration ◽

Real Time ◽

Time Data ◽

Real Time Data

Download Full-text

Continuous Improvement through Real-Time Data Integration into Reservoir Management Workflows

10.2118/128660-ms ◽

2010 ◽

Cited By ~ 4

Author(s):

Tor K. Kragas ◽

Oktay Metin Gokdemir

Keyword(s):

Data Integration ◽

Real Time ◽

Continuous Improvement ◽

Reservoir Management ◽

Time Data ◽

Real Time Data

Download Full-text

Computer Vision for Continuous Bedside Pharmacological Data Extraction: A Novel Application of Artificial Intelligence for Clinical Data Recording and Biomedical Research

Frontiers in Big Data ◽

10.3389/fdata.2021.689358 ◽

2021 ◽

Vol 4 ◽

Author(s):

Logan Froese ◽

Joshua Dian ◽

Carleen Batson ◽

Alwyn Gomez ◽

Amanjyot Singh Sainbhi ◽

...

Keyword(s):

Computer Vision ◽

Data Integration ◽

Real Time ◽

Clinical Data ◽

Biomedical Research ◽

Character Recognition ◽

Optical Character Recognition ◽

Time Data ◽

Optical Character ◽

Real Time Data

Introduction: As real time data processing is integrated with medical care for traumatic brain injury (TBI) patients, there is a requirement for devices to have digital output. However, there are still many devices that fail to have the required hardware to export real time data into an acceptable digital format or in a continuously updating manner. This is particularly the case for many intravenous pumps and older technological systems. Such accurate and digital real time data integration within TBI care and other fields is critical as we move towards digitizing healthcare information and integrating clinical data streams to improve bedside care. We propose to address this gap in technology by building a system that employs Optical Character Recognition through computer vision, using real time images from a pump monitor to extract the desired real time information.Methods: Using freely available software and readily available technology, we built a script that extracts real time images from a medication pump and then processes them using Optical Character Recognition to create digital text from the image. This text was then transferred to an ICM + real-time monitoring software in parallel with other retrieved physiological data.Results: The prototype that was built works effectively for our device, with source code openly available to interested end-users. However, future work is required for a more universal application of such a system.Conclusion: Advances here can improve medical information collection in the clinical environment, eliminating human error with bedside charting, and aid in data integration for biomedical research where many complex data sets can be seamlessly integrated digitally. Our design demonstrates a simple adaptation of current technology to help with this integration.

Download Full-text

Urban Planning and Smart City Decision Management Empowered by Real-Time Data Processing Using Big Data Analytics

Sensors ◽

10.3390/s18092994 ◽

2018 ◽

Vol 18 (9) ◽

pp. 2994 ◽

Cited By ~ 28

Author(s):

Bhagya Silva ◽

Murad Khan ◽

Changsu Jung ◽

Jihun Seo ◽

Diyan Muhammad ◽

...

Keyword(s):

Big Data ◽

Data Processing ◽

Real Time ◽

Smart City ◽

Data Analytics ◽

Smart Cities ◽

Big Data Analytics ◽

Time Data ◽

Real Time Data

The Internet of Things (IoT), inspired by the tremendous growth of connected heterogeneous devices, has pioneered the notion of smart city. Various components, i.e., smart transportation, smart community, smart healthcare, smart grid, etc. which are integrated within smart city architecture aims to enrich the quality of life (QoL) of urban citizens. However, real-time processing requirements and exponential data growth withhold smart city realization. Therefore, herein we propose a Big Data analytics (BDA)-embedded experimental architecture for smart cities. Two major aspects are served by the BDA-embedded smart city. Firstly, it facilitates exploitation of urban Big Data (UBD) in planning, designing, and maintaining smart cities. Secondly, it occupies BDA to manage and process voluminous UBD to enhance the quality of urban services. Three tiers of the proposed architecture are liable for data aggregation, real-time data management, and service provisioning. Moreover, offline and online data processing tasks are further expedited by integrating data normalizing and data filtering techniques to the proposed work. By analyzing authenticated datasets, we obtained the threshold values required for urban planning and city operation management. Performance metrics in terms of online and offline data processing for the proposed dual-node Hadoop cluster is obtained using aforementioned authentic datasets. Throughput and processing time analysis performed with regard to existing works guarantee the performance superiority of the proposed work. Hence, we can claim the applicability and reliability of implementing proposed BDA-embedded smart city architecture in the real world.

Download Full-text