HYBRIDJOIN for Near-Real-Time Data Warehousing

An important component of near-real-time data warehouses is the near-real-time integration layer. One important element in near-real-time data integration is the join of a continuous input data stream with a disk-based relation. For high-throughput streams, stream-based algorithms, such as Mesh Join (MESHJOIN), can be used. However, in MESHJOIN the performance of the algorithm is inversely proportional to the size of disk-based relation. The Index Nested Loop Join (INLJ) can be set up so that it processes stream input, and can deal with intermittences in the update stream but it has low throughput. This paper introduces a robust stream-based join algorithm called Hybrid Join (HYBRIDJOIN), which combines the two approaches. A theoretical result shows that HYBRIDJOIN is asymptotically as fast as the fastest of both algorithms. The authors present performance measurements of the implementation. In experiments using synthetic data based on a Zipfian distribution, HYBRIDJOIN performs significantly better for typical parameters of the Zipfian distribution, and in general performs in accordance with the theoretical model while the other two algorithms are unacceptably slow under different settings.

Download Full-text

An Innovative Method to Extract Data in a Real-time Data Warehousing Environment

10.5121/csit.2021.112401 ◽

2021 ◽

Author(s):

Flavio de Assis Vilela ◽

Ricardo Rodrigues Ciferri

Keyword(s):

Real Time ◽

Data Warehousing ◽

Data Extraction ◽

Synthetic Data ◽

Knowledge Discovery In Databases ◽

Data Repository ◽

Time Data ◽

Innovative Method ◽

Real Time Data ◽

Time Requirements

ETL (Extract, Transform, and Load) is an essential process required to perform data extraction in knowledge discovery in databases and in data warehousing environments. The ETL process aims to gather data that is available from operational sources, process and store them into an integrated data repository. Also, the ETL process can be performed in a real-time data warehousing environment and store data into a data warehouse. This paper presents a new and innovative method named Data Extraction Magnet (DEM) to perform the extraction phase of ETL process in a real-time data warehousing environment based on non-intrusive, tag and parallelism concepts. DEM has been validated on a dairy farming domain using synthetic data. The results showed a great performance gain in comparison to the traditional trigger technique and the attendance of real-time requirements.

Download Full-text

A Methodology for Heterogeneous Sensor Data Organization and Near Real-Time Data Sharing by Adopting OGC SWE Standards

ISPRS International Journal of Geo-Information ◽

10.3390/ijgi8040167 ◽

2019 ◽

Vol 8 (4) ◽

pp. 167 ◽

Cited By ~ 1

Author(s):

Bartolomeo Ventura ◽

Andrea Vianello ◽

Daniel Frisinghelli ◽

Mattia Rossi ◽

Roberto Monsorno ◽

...

Keyword(s):

Real Time ◽

Sensor Data ◽

Weather Data ◽

Time Data ◽

Heterogeneous Sensors ◽

Heterogeneous Datasets ◽

Real Time Data ◽

Sensor Registration ◽

Set Up ◽

Minimum Interaction

Finding a solution to collect, analyze, and share, in near real-time, data acquired by heterogeneous sensors, such as traffic, air pollution, soil moisture, or weather data, represents a great challenge. This paper describes the solution developed at Eurac Research to automatically upload data, in near real-time, by adopting Open Geospatial Consortium (OGC) Sensor Web Enablement (SWE) standards to guarantee interoperability. We set up a methodology capable of ingesting heterogeneous datasets to automatize observation uploading and sensor registration, with minimum interaction required of the user. This solution has been successfully tested and applied in the Long Term (Socio-)Ecological Research (LT(S)ER) Matsch-Mazia initiative, and the code is accessible under the CC BY 4.0 license.

Download Full-text

A Survey on Data-Driven Learning for Intelligent Network Intrusion Detection Systems

Electronics ◽

10.3390/electronics11020213 ◽

2022 ◽

Vol 11 (2) ◽

pp. 213

Author(s):

Ghada Abdelmoumin ◽

Jessica Whitaker ◽

Danda B. Rawat ◽

Abdul Rahman

Keyword(s):

Intrusion Detection ◽

Real Time ◽

Data Augmentation ◽

Synthetic Data ◽

Skewed Distribution ◽

Rapid Review ◽

Time Data ◽

Adversarial Learning ◽

Network Intrusion ◽

Real Time Data

An effective anomaly-based intelligent IDS (AN-Intel-IDS) must detect both known and unknown attacks. Hence, there is a need to train AN-Intel-IDS using dynamically generated, real-time data in an adversarial setting. Unfortunately, the public datasets available to train AN-Intel-IDS are ineluctably static, unrealistic, and prone to obsolescence. Further, the need to protect private data and conceal sensitive data features has limited data sharing, thus encouraging the use of synthetic data for training predictive and intrusion detection models. However, synthetic data can be unrealistic and potentially bias. On the other hand, real-time data are realistic and current; however, it is inherently imbalanced due to the uneven distribution of anomalous and non-anomalous examples. In general, non-anomalous or normal examples are more frequent than anomalous or attack examples, thus leading to skewed distribution. While imbalanced data are commonly predominant in intrusion detection applications, it can lead to inaccurate predictions and degraded performance. Furthermore, the lack of real-time data produces potentially biased models that are less effective in predicting unknown attacks. Therefore, training AN-Intel-IDS using imbalanced and adversarial learning is instrumental to their efficacy and high performance. This paper investigates imbalanced learning and adversarial learning for training AN-Intel-IDS using a qualitative study. It surveys and synthesizes generative-based data augmentation techniques for addressing the uneven data distribution and generative-based adversarial techniques for generating synthetic yet realistic data in an adversarial setting using rapid review, structured reporting, and subgroup analysis.

Download Full-text

Real-time data assimilation potential to connect micro-smart water test bed and hydraulic model

H2Open Journal ◽

10.2166/h2oj.2019.006 ◽

2019 ◽

Vol 2 (1) ◽

pp. 71-82 ◽

Cited By ~ 2

Author(s):

Jiada Li ◽

Shuangli Bao ◽

Steven Burian

Keyword(s):

Real Time ◽

Hydraulic Model ◽

Test Bed ◽

Time Data ◽

Hydraulic Simulation ◽

Smart Water ◽

Sensing Technology ◽

Real Time Data ◽

Water Test ◽

Set Up

Abstract Recently, smart water application has gained worldwide attention, but there is a lack of understanding of how to construct smart water networks. This is partly because of the limited investigation into how to combine physical experiments with model simulations. This study aimed to investigate the process of connecting micro-smart water test bed (MWTB) and a ‘two-loop’ EPANET hydraulic model, which involves experimental set-up, real-time data acquisition, hydraulic simulation, and system performance demonstration. In this study, a MWTB was established based on the flow sensing technology. The data generated by the MWTB were stored in Observations Data Model (ODM) database for visualization in RStudio environment and also archived as the input of EPANET hydraulic simulation. The data visualization fitted the operation scenarios of the MWTB well. Additionally, the fitting degree between the experimental measurements and modeling outputs indicates the ‘two-loop’ EPANET model can represent the operation of MWTB for better understanding of hydraulic analysis.

Download Full-text

Statistical Methods to Improve the Quality of Real-Time Drilling Data

Journal of Energy Resources Technology ◽

10.1115/1.4053519 ◽

2022 ◽

pp. 1-22

Author(s):

Salem Al-Gharbi ◽

Abdulaziz Al-Majed ◽

Abdulazeez Abdulraheem ◽

Zeeshan Tariq ◽

Mohamed Mahmoud

Keyword(s):

Real Time ◽

Input Data ◽

Moving Average ◽

Median Filter ◽

Exponential Smoothing ◽

Time Data ◽

Drilling Data ◽

Real Time Data ◽

Data Points

Abstract The age of easy oil is ending, the industry started drilling in remote unconventional conditions. To help produce safer, faster, and most effective operations, the utilization of artificial intelligence and machine learning (AI/ML) has become essential. Unfortunately, due to the harsh environments of drilling and the data-transmission setup, a significant amount of the real-time data could defect. The quality and effectiveness of AI/ML models are directly related to the quality of the input data; only if the input data are good, the AI/ML generated analytical and prediction models will be good. Improving the real-time data is therefore critical to the drilling industry. The objective of this paper is to propose an automated approach using eight statistical data-quality improvement algorithms on real-time drilling data. These techniques are Kalman filtering, moving average, kernel regression, median filter, exponential smoothing, lowess, wavelet filtering, and polynomial. A dataset of +150,000 rows is fed into the algorithms, and their customizable parameters are calibrated to achieve the best improvement result. An evaluation methodology is developed based on real-time drilling data characteristics to analyze the strengths and weaknesses of each algorithm were highlighted. Based on the evaluation criteria, the best results were achieved using the exponential smoothing, median filter, and moving average. Exponential smoothing and median filter techniques improved the quality of data by removing most of the invalid data points, the moving average removed more invalid data-points but trimmed the data range.

Download Full-text

The length of input data buffers in real time data acquisition systems

Nuclear Instruments and Methods ◽

10.1016/0029-554x(80)90472-3 ◽

1980 ◽

Vol 171 (3) ◽

pp. 577-585

Author(s):

P. Volkov

Keyword(s):

Data Acquisition ◽

Real Time ◽

Input Data ◽

Time Data ◽

Data Acquisition Systems ◽

Real Time Data

Download Full-text

Optimizing Semi-Stream CACHEJOIN for Near-Real- Time Data Warehousing

Journal of Database Management ◽

10.4018/jdm.2020010102 ◽

2020 ◽

Vol 31 (1) ◽

pp. 20-37 ◽

Cited By ~ 1

Author(s):

M. Asif Naeem ◽

Erum Mehmood ◽

M. G. Abbas Malik ◽

Noreen Jamil

Keyword(s):

Real Time ◽

Data Warehousing ◽

Streaming Data ◽

Service Rate ◽

Stream Data ◽

Time Data ◽

Join Algorithm ◽

Real Time Data ◽

Two Phases ◽

Critical Process

Streaming data join is a critical process in the field of near-real-time data warehousing. For this purpose, an adaptive semi-stream join algorithm called CACHEJOIN (Cache Join) focusing non-uniform stream data is provided in the literature. However, this algorithm cannot exploit the memory and CPU resources optimally and consequently it leaves its service rate suboptimal due to sequential execution of both of its phases, called stream-probing (SP) phase and disk-probing (DP) phase. By integrating the advantages of CACHEJOIN, this article presents two modifications for it. The first is called P-CACHEJOIN (Parallel Cache Join) that enables the parallel processing of two phases in CACHEJOIN. This increases number of joined stream records and therefore improves throughput considerably. The second is called OP-CACHEJOIN (Optimized Parallel Cache Join) that implements a parallel loading of stored data into memory while the DP phase is executing. This research presents the performance analysis of both of the approaches defined within the paper existing CACHEJOIN empirically using synthetic skewed dataset.

Download Full-text