scholarly journals HYBRIDJOIN for Near-Real-Time Data Warehousing

Author(s):  
M. Asif Naeem ◽  
Gillian Dobbie ◽  
Gerald Weber

An important component of near-real-time data warehouses is the near-real-time integration layer. One important element in near-real-time data integration is the join of a continuous input data stream with a disk-based relation. For high-throughput streams, stream-based algorithms, such as Mesh Join (MESHJOIN), can be used. However, in MESHJOIN the performance of the algorithm is inversely proportional to the size of disk-based relation. The Index Nested Loop Join (INLJ) can be set up so that it processes stream input, and can deal with intermittences in the update stream but it has low throughput. This paper introduces a robust stream-based join algorithm called Hybrid Join (HYBRIDJOIN), which combines the two approaches. A theoretical result shows that HYBRIDJOIN is asymptotically as fast as the fastest of both algorithms. The authors present performance measurements of the implementation. In experiments using synthetic data based on a Zipfian distribution, HYBRIDJOIN performs significantly better for typical parameters of the Zipfian distribution, and in general performs in accordance with the theoretical model while the other two algorithms are unacceptably slow under different settings.

2011 ◽  
Vol 7 (4) ◽  
pp. 21-42 ◽  
Author(s):  
M. Asif Naeem ◽  
Gillian Dobbie ◽  
Gerald Weber

An important component of near-real-time data warehouses is the near-real-time integration layer. One important element in near-real-time data integration is the join of a continuous input data stream with a disk-based relation. For high-throughput streams, stream-based algorithms, such as Mesh Join (MESHJOIN), can be used. However, in MESHJOIN the performance of the algorithm is inversely proportional to the size of disk-based relation. The Index Nested Loop Join (INLJ) can be set up so that it processes stream input, and can deal with intermittences in the update stream but it has low throughput. This paper introduces a robust stream-based join algorithm called Hybrid Join (HYBRIDJOIN), which combines the two approaches. A theoretical result shows that HYBRIDJOIN is asymptotically as fast as the fastest of both algorithms. The authors present performance measurements of the implementation. In experiments using synthetic data based on a Zipfian distribution, HYBRIDJOIN performs significantly better for typical parameters of the Zipfian distribution, and in general performs in accordance with the theoretical model while the other two algorithms are unacceptably slow under different settings.


2021 ◽  
Author(s):  
Flavio de Assis Vilela ◽  
Ricardo Rodrigues Ciferri

ETL (Extract, Transform, and Load) is an essential process required to perform data extraction in knowledge discovery in databases and in data warehousing environments. The ETL process aims to gather data that is available from operational sources, process and store them into an integrated data repository. Also, the ETL process can be performed in a real-time data warehousing environment and store data into a data warehouse. This paper presents a new and innovative method named Data Extraction Magnet (DEM) to perform the extraction phase of ETL process in a real-time data warehousing environment based on non-intrusive, tag and parallelism concepts. DEM has been validated on a dairy farming domain using synthetic data. The results showed a great performance gain in comparison to the traditional trigger technique and the attendance of real-time requirements.


2019 ◽  
Vol 8 (4) ◽  
pp. 167 ◽  
Author(s):  
Bartolomeo Ventura ◽  
Andrea Vianello ◽  
Daniel Frisinghelli ◽  
Mattia Rossi ◽  
Roberto Monsorno ◽  
...  

Finding a solution to collect, analyze, and share, in near real-time, data acquired by heterogeneous sensors, such as traffic, air pollution, soil moisture, or weather data, represents a great challenge. This paper describes the solution developed at Eurac Research to automatically upload data, in near real-time, by adopting Open Geospatial Consortium (OGC) Sensor Web Enablement (SWE) standards to guarantee interoperability. We set up a methodology capable of ingesting heterogeneous datasets to automatize observation uploading and sensor registration, with minimum interaction required of the user. This solution has been successfully tested and applied in the Long Term (Socio-)Ecological Research (LT(S)ER) Matsch-Mazia initiative, and the code is accessible under the CC BY 4.0 license.


Electronics ◽  
2022 ◽  
Vol 11 (2) ◽  
pp. 213
Author(s):  
Ghada Abdelmoumin ◽  
Jessica Whitaker ◽  
Danda B. Rawat ◽  
Abdul Rahman

An effective anomaly-based intelligent IDS (AN-Intel-IDS) must detect both known and unknown attacks. Hence, there is a need to train AN-Intel-IDS using dynamically generated, real-time data in an adversarial setting. Unfortunately, the public datasets available to train AN-Intel-IDS are ineluctably static, unrealistic, and prone to obsolescence. Further, the need to protect private data and conceal sensitive data features has limited data sharing, thus encouraging the use of synthetic data for training predictive and intrusion detection models. However, synthetic data can be unrealistic and potentially bias. On the other hand, real-time data are realistic and current; however, it is inherently imbalanced due to the uneven distribution of anomalous and non-anomalous examples. In general, non-anomalous or normal examples are more frequent than anomalous or attack examples, thus leading to skewed distribution. While imbalanced data are commonly predominant in intrusion detection applications, it can lead to inaccurate predictions and degraded performance. Furthermore, the lack of real-time data produces potentially biased models that are less effective in predicting unknown attacks. Therefore, training AN-Intel-IDS using imbalanced and adversarial learning is instrumental to their efficacy and high performance. This paper investigates imbalanced learning and adversarial learning for training AN-Intel-IDS using a qualitative study. It surveys and synthesizes generative-based data augmentation techniques for addressing the uneven data distribution and generative-based adversarial techniques for generating synthetic yet realistic data in an adversarial setting using rapid review, structured reporting, and subgroup analysis.


2019 ◽  
Vol 2 (1) ◽  
pp. 71-82 ◽  
Author(s):  
Jiada Li ◽  
Shuangli Bao ◽  
Steven Burian

Abstract Recently, smart water application has gained worldwide attention, but there is a lack of understanding of how to construct smart water networks. This is partly because of the limited investigation into how to combine physical experiments with model simulations. This study aimed to investigate the process of connecting micro-smart water test bed (MWTB) and a ‘two-loop’ EPANET hydraulic model, which involves experimental set-up, real-time data acquisition, hydraulic simulation, and system performance demonstration. In this study, a MWTB was established based on the flow sensing technology. The data generated by the MWTB were stored in Observations Data Model (ODM) database for visualization in RStudio environment and also archived as the input of EPANET hydraulic simulation. The data visualization fitted the operation scenarios of the MWTB well. Additionally, the fitting degree between the experimental measurements and modeling outputs indicates the ‘two-loop’ EPANET model can represent the operation of MWTB for better understanding of hydraulic analysis.


2022 ◽  
pp. 1-22
Author(s):  
Salem Al-Gharbi ◽  
Abdulaziz Al-Majed ◽  
Abdulazeez Abdulraheem ◽  
Zeeshan Tariq ◽  
Mohamed Mahmoud

Abstract The age of easy oil is ending, the industry started drilling in remote unconventional conditions. To help produce safer, faster, and most effective operations, the utilization of artificial intelligence and machine learning (AI/ML) has become essential. Unfortunately, due to the harsh environments of drilling and the data-transmission setup, a significant amount of the real-time data could defect. The quality and effectiveness of AI/ML models are directly related to the quality of the input data; only if the input data are good, the AI/ML generated analytical and prediction models will be good. Improving the real-time data is therefore critical to the drilling industry. The objective of this paper is to propose an automated approach using eight statistical data-quality improvement algorithms on real-time drilling data. These techniques are Kalman filtering, moving average, kernel regression, median filter, exponential smoothing, lowess, wavelet filtering, and polynomial. A dataset of +150,000 rows is fed into the algorithms, and their customizable parameters are calibrated to achieve the best improvement result. An evaluation methodology is developed based on real-time drilling data characteristics to analyze the strengths and weaknesses of each algorithm were highlighted. Based on the evaluation criteria, the best results were achieved using the exponential smoothing, median filter, and moving average. Exponential smoothing and median filter techniques improved the quality of data by removing most of the invalid data points, the moving average removed more invalid data-points but trimmed the data range.


2020 ◽  
Vol 31 (1) ◽  
pp. 20-37 ◽  
Author(s):  
M. Asif Naeem ◽  
Erum Mehmood ◽  
M. G. Abbas Malik ◽  
Noreen Jamil

Streaming data join is a critical process in the field of near-real-time data warehousing. For this purpose, an adaptive semi-stream join algorithm called CACHEJOIN (Cache Join) focusing non-uniform stream data is provided in the literature. However, this algorithm cannot exploit the memory and CPU resources optimally and consequently it leaves its service rate suboptimal due to sequential execution of both of its phases, called stream-probing (SP) phase and disk-probing (DP) phase. By integrating the advantages of CACHEJOIN, this article presents two modifications for it. The first is called P-CACHEJOIN (Parallel Cache Join) that enables the parallel processing of two phases in CACHEJOIN. This increases number of joined stream records and therefore improves throughput considerably. The second is called OP-CACHEJOIN (Optimized Parallel Cache Join) that implements a parallel loading of stored data into memory while the DP phase is executing. This research presents the performance analysis of both of the approaches defined within the paper existing CACHEJOIN empirically using synthetic skewed dataset.


Diabetes ◽  
2020 ◽  
Vol 69 (Supplement 1) ◽  
pp. 399-P
Author(s):  
ANN MARIE HASSE ◽  
RIFKA SCHULMAN ◽  
TORI CALDER

Sign in / Sign up

Export Citation Format

Share Document