data staging Latest Research Papers

While a data warehouse is designed to support the decision-making function, the most time-consuming partis the Extract Transform Load (ETL) process. Case in Academic Data Warehouse, when data source came from thefaculty’s distributed database, although having a typical database but become not easier to integrate. This paperpresents how to an ETL process in distributed database academic data warehouse. Following Data Flow Threadprocess in the data staging area, a deep analysis performed for identifying all tables in each data sources, includingcontent profiling. Then the cleaning, confirming, and data delivery steps pour the different data source into the datawarehouse (DW). Since DW development using bottom-up Kimball’s multidimensional approach, we found the threetypes of extraction activities from data source table: merge, merge-union, and union. Result for cleaning andconforming step set by creating conform dimension on data source analysis, refinement, and hierarchy structure. Thefinal of the ETL step is loading it into integrating dimension and fact tables by a generation of a surrogate key. Thoseprocesses are running gradually from each distributed database data sources until it incorporated. This technicalactivity in distributed database ETL process generally can be adopted widely in other industries which designer musthave advance knowledge to structure and content of data source.

Download Full-text

Comparing Data Staging Techniques for Large Scale Brain Images

IEEE Transactions on Emerging Topics in Computing ◽

10.1109/tetc.2020.3028744 ◽

2020 ◽

pp. 1-1

Author(s):

Lena Oden

Keyword(s):

Large Scale ◽

Brain Images ◽

Data Staging

Download Full-text

Data staging for efficient high throughput stream processing

Parallel Computing ◽

10.1016/j.parco.2019.102566 ◽

2019 ◽

Vol 90 ◽

pp. 102566 ◽

Cited By ~ 1

Author(s):

Thaddeus Koehn ◽

Peter Athanas

Keyword(s):

High Throughput ◽

Stream Processing ◽

Data Staging

Download Full-text

Performance-Aware Scheduling of Parallel Applications on Non-Dedicated Clusters

Electronics ◽

10.3390/electronics8090982 ◽

2019 ◽

Vol 8 (9) ◽

pp. 982 ◽

Cited By ~ 1

Author(s):

Alberto Cascajo ◽

David E. Singh ◽

Jesus Carretero

Keyword(s):

Large Scale ◽

Job Scheduling ◽

Parallel Applications ◽

Data Staging ◽

Performance Improvements ◽

Practical Evaluation ◽

Significant Performance ◽

Scalable Monitoring ◽

And Control ◽

New Strategies

This work presents a HPC framework that provides new strategies for resource management and job scheduling, based on executing different applications in shared compute nodes, maximizing platform utilization. The framework includes a scalable monitoring tool that is able to analyze the platform’s compute node utilization. We also introduce an extension of CLARISSE, a middleware for data-staging coordination and control on large-scale HPC platforms that uses the information provided by the monitor in combination with application-level analysis to detect performance degradation in the running applications. This degradation, caused by the fact that the applications share the compute nodes and may compete for their resources, is avoided by means of dynamic application migration. A description of the architecture, as well as a practical evaluation of the proposal, shows significant performance improvements up to 20% in the makespan and 10% in energy consumption compared to a non-optimized execution.

Download Full-text

NORNS: Extending Slurm to Support Data-Driven Workflows through Asynchronous Data Staging

2019 IEEE International Conference on Cluster Computing (CLUSTER) ◽

10.1109/cluster.2019.8891014 ◽

2019 ◽

Cited By ~ 1

Author(s):

Alberto Miranda ◽

Adrian Jackson ◽

Tommaso Tocci ◽

Iakovos Panourgias ◽

Ramon Nou

Keyword(s):

Data Driven ◽

Data Staging

Download Full-text

Extract Transform Load (ETL) Process in Distributed Database Academic Data Warehouse

APTIKOM Journal on Computer Science and Information Technologies ◽

10.11591/aptikom.j.csit.36 ◽

2019 ◽

Vol 4 (2) ◽

pp. 61-68

Author(s):

Ardhian Agung Yulianto

Keyword(s):

Decision Making ◽

Data Warehouse ◽

Distributed Database ◽

Data Sources ◽

Data Delivery ◽

Multidimensional Model ◽

Data Staging ◽

Staging Area ◽

Data Source ◽

Sql Query

While data warehouse is designed to support the decision-making function, the most time-consuming part is Extract Transform Load (ETL) process. Case in Academic Data Warehouse, when data source came from faculty’s distributed database, although having a typical database but become not easier to integrate. This paper presents the ETL detail process following Data Flow Thread in data staging area for identifying, profiling, the content analyzing including all tables in data sources, and then cleaning, confirming dimension and data delivery to the data warehouse. Those processes are running gradually from each distributed database data sources until it merged. Dimension table and fact table are generated in a multidimensional model. ETL tool is Pentaho Data Integration 6.1. ETL testing is done by comparing data source and data target and DW testing conducted by comparing the data analysis between SQL query and Saiku Analytics plugin in Pentaho Business Analytic Server.

Download Full-text

Accelerating Machine Learning I/O by Overlapping Data Staging and Mini-batch Generations

Proceedings of the 6th IEEE/ACM International Conference on Big Data Computing, Applications and Technologies - BDCAT '19 ◽

10.1145/3365109.3368768 ◽

2019 ◽

Author(s):

Kazuhiro Serizawa ◽

Osamu Tatebe

Keyword(s):

Machine Learning ◽

Data Staging ◽

Overlapping Data

Download Full-text

Stacker: An Autonomic Data Movement Engine for Extreme-Scale Data Staging-Based In-Situ Workflows

SC18: International Conference for High Performance Computing, Networking, Storage and Analysis ◽

10.1109/sc.2018.00076 ◽

2018 ◽

Cited By ~ 13

Author(s):

Pradeep Subedi ◽

Philip Davis ◽

Shaohua Duan ◽

Scott Klasky ◽

Hemanth Kolla ◽

...

Keyword(s):

Data Staging ◽

Data Movement ◽

Extreme Scale ◽

Scale Data

Download Full-text

Dancing Tables: Digitizing 11,000 Film-based Slides in Ten Days

Biodiversity Information Science and Standards ◽

10.3897/biss.2.28093 ◽

2018 ◽

Vol 2 ◽

pp. e28093

Author(s):

Lisa Palmer

Keyword(s):

First Generation ◽

Smithsonian Institution ◽

Public Access ◽

Data Conversion ◽

Quality Of Data ◽

Data Staging ◽

Image Conversion ◽

The Pacific ◽

Short Period ◽

Remote Island

How long does it take to digitize 11,000 film-based slides? Converting film to a raster graphic may take a relatively short period of time, but what is needed to prepare for the process, and then once images are digitized, what work is required to push data out for public access? And how much does the entire conversion process cost? A case study of a rapid-capture digitization project at the Smithsonian Institution will be reviewed. In early 2016, the Smithsonian Institution National Museum of Natural History (NMNH) Division of Fishes acquired 10,559 film-based slides from world-renown ichthyologist John (Jack) Randall. The first-generation slides contain images of color patterns of hundreds of fish species with locality information for each specimen written on the cardboard slide mount. When Jack began his photography in the 1960’s, his images were at the forefront of color photography for fishes. He also collected specimens in remote island archipelagos in the Pacific and Indian Oceans, thus many localities were, and continue to be, rare. The species represented on the slide are important to the scientific community, and the collection event data written on the slide mount makes the image and its metadata an invaluable package of information. Upon receipt of Jack’s significant donation, the Division of Fishes received multiple requests from ichthyologists for digital access to the slides. The Division of Fishes immediately implemented a plan to digitally capture data. With many rapid-capture projects at the Smithsonian, the objects and specimens are digitized, and then at some later point, any associated data is transcribed. The Division approached this project differently in that the Randall collection was relatively small, and Smithsonian staff, primarily interns, were available to transcribe data before image conversion. Post-production work included hiring two contractors to import images and associated metadata into NMNH’s collections management system. This presentation will review our processes before, during, and after data conversion. Workflows include transcribing handwritten data, staging and digitizing film, and importing data into the EMu client as well as using redundancies to ensure quality of data.

Download Full-text

data staging
Recently Published Documents

TOTAL DOCUMENTS

H-INDEX

A cluster based model for brain activity data staging

Extract transform load (ETL) process in distributed database academic data warehouse

Comparing Data Staging Techniques for Large Scale Brain Images

Data staging for efficient high throughput stream processing

Performance-Aware Scheduling of Parallel Applications on Non-Dedicated Clusters

NORNS: Extending Slurm to Support Data-Driven Workflows through Asynchronous Data Staging

Extract Transform Load (ETL) Process in Distributed Database Academic Data Warehouse

Accelerating Machine Learning I/O by Overlapping Data Staging and Mini-batch Generations

Stacker: An Autonomic Data Movement Engine for Extreme-Scale Data Staging-Based In-Situ Workflows

Dancing Tables: Digitizing 11,000 Film-based Slides in Ten Days

Export Citation Format

data stagingRecently Published Documents

TOTAL DOCUMENTS

H-INDEX

A cluster based model for brain activity data staging

Extract transform load (ETL) process in distributed database academic data warehouse

Comparing Data Staging Techniques for Large Scale Brain Images

Data staging for efficient high throughput stream processing

Performance-Aware Scheduling of Parallel Applications on Non-Dedicated Clusters

NORNS: Extending Slurm to Support Data-Driven Workflows through Asynchronous Data Staging

Extract Transform Load (ETL) Process in Distributed Database Academic Data Warehouse

Accelerating Machine Learning I/O by Overlapping Data Staging and Mini-batch Generations

Stacker: An Autonomic Data Movement Engine for Extreme-Scale Data Staging-Based In-Situ Workflows

Dancing Tables: Digitizing 11,000 Film-based Slides in Ten Days

data staging
Recently Published Documents