scholarly journals An Innovative Method to Extract Data in a Real-time Data Warehousing Environment

2021 ◽  
Author(s):  
Flavio de Assis Vilela ◽  
Ricardo Rodrigues Ciferri

ETL (Extract, Transform, and Load) is an essential process required to perform data extraction in knowledge discovery in databases and in data warehousing environments. The ETL process aims to gather data that is available from operational sources, process and store them into an integrated data repository. Also, the ETL process can be performed in a real-time data warehousing environment and store data into a data warehouse. This paper presents a new and innovative method named Data Extraction Magnet (DEM) to perform the extraction phase of ETL process in a real-time data warehousing environment based on non-intrusive, tag and parallelism concepts. DEM has been validated on a dairy farming domain using synthetic data. The results showed a great performance gain in comparison to the traditional trigger technique and the attendance of real-time requirements.

In the standard ETL (Extract Processing Load), the data warehouse refreshment must be performed outside of peak hours. i It implies i that the i functioning and i analysis has stopped in their iall actions. iIt causes the iamount of icleanness of i data from the idata Warehouse which iisn't suggesting ithe latest i operational transections. This i issue is i known as i data i latency. The data warehousing is iemployed to ibe a iremedy for ithis iissue. It updates the idata warehouse iat a inear real-time iFashion, instantly after data found from the data source. Therefore, data i latency could i be reduced. Hence the near real time data warehousing was having issues which was not identified in traditional ETL. This paper claims to communicate the issues and accessible options at every point iin the i near real-time i data warehousing, i.e. i The i issues and Available alternatives iare based ion ia literature ireview by additional iStudy that ifocus ion near real-time data iwarehousing issue


2020 ◽  
Vol ahead-of-print (ahead-of-print) ◽  
Author(s):  
Sandeep Kumar Singh ◽  
Mamata Jenamani

Purpose The purpose of this paper is to design a supply chain database schema for Cassandra to store real-time data generated by Radio Frequency IDentification technology in a traceability system. Design/methodology/approach The real-time data generated in such traceability systems are of high frequency and volume, making it difficult to handle by traditional relational database technologies. To overcome this difficulty, a NoSQL database repository based on Casandra is proposed. The efficacy of the proposed schema is compared with two such databases, document-based MongoDB and column family-based Cassandra, which are suitable for storing traceability data. Findings The proposed Cassandra-based data repository outperforms the traditional Structured Query Language-based and MongoDB system from the literature in terms of concurrent reading, and works at par with respect to writing and updating of tracing queries. Originality/value The proposed schema is able to store the real-time data generated in a supply chain with low latency. To test the performance of the Cassandra-based data repository, a test-bed is designed in the lab and supply chain operations of Indian Public Distribution System are simulated to generate data.


Sign in / Sign up

Export Citation Format

Share Document