scholarly journals A dataspace-based framework for OLAP analyses in a high-variety multistore

2021 ◽  
Author(s):  
Chiara Forresi ◽  
Enrico Gallinucci ◽  
Matteo Golfarelli ◽  
Hamdi Ben Hamadou

AbstractThe success of NoSQL DBMSs has pushed the adoption of polyglot storage systems that take advantage of the best characteristics of different technologies and data models. While operational applications take great benefit from this choice, analytical applications suffer the absence of schema consistency, not only between different DBMSs but within a single NoSQL system as well. In this context, the discipline of data science is steering analysts away from traditional data warehousing and toward a more flexible and lightweight approach to data analysis. The idea is to perform OLAP analyses in a pay-as-you-go manner across heterogeneous schemas and data models, where the integration is progressively carried out by the user as the available data is explored. In this paper, we propose an approach to support data analysis within a high-variety multistore, with heterogeneous schemas and overlapping records. Our approach supports relational, document, wide-column, and key-value data models by automatically handling both data model and schema heterogeneity through a dataspace layer on top of the underlying DBMSs. The expressiveness we enable corresponds to GPSJ queries, which are the most common class of queries in OLAP applications. We rely on nested relational algebra to define a cross-database execution plan. The system has been prototyped on Apache Spark.

2018 ◽  
Vol 37 (3) ◽  
pp. 29-49
Author(s):  
Kumar Sharma ◽  
Ujjal Marjit ◽  
Utpal Biswas

Resource Description Framework (RDF) is a commonly used data model in the Semantic Web environment. Libraries and various other communities have been using the RDF data model to store valuable data after it is extracted from traditional storage systems. However, because of the large volume of the data, processing and storing it is becoming a nightmare for traditional data-management tools. This challenge demands a scalable and distributed system that can manage data in parallel. In this article, a distributed solution is proposed for efficiently processing and storing the large volume of library linked data stored in traditional storage systems. Apache Spark is used for parallel processing of large data sets and a column-oriented schema is proposed for storing RDF data. The storage system is built on top of Hadoop Distributed File Systems (HDFS) and uses the Apache Parquet format to store data in a compressed form. The experimental evaluation showed that storage requirements were reduced significantly as compared to Jena TDB, Sesame, RDF/XML, and N-Triples file formats. SPARQL queries are processed using Spark SQL to query the compressed data. The experimental evaluation showed a good query response time, which significantly reduces as the number of worker nodes increases.


Author(s):  
Nenad Jukic ◽  
Svetlozar Nestorov ◽  
Susan V. Vrbsky ◽  
Allen Parrish

In this chapter, we extend the multi-level secure (MLS) data model to include nonkey related cover stories so that key attributes can have different values at different security levels. MLS data models require the classification of data and users into multiple security levels. In MLS systems, cover stories allow information provided to users at lower security levels to differ from information provided to users at higher security levels. Previous versions of the MLS model did not permit cover stories for key attributes because the key is used to relate the various cover stories for a particular entity. We present the necessary model changes and modifications to the relational algebra, which are required to implement cover stories for keys. We demonstrate the improvements made by these changes, illustrate the increased expressiveness of the model, and determine the soundness of a database, based on the described concepts.


Author(s):  
Markus Herrmann ◽  
Jörg Petzold ◽  
Vivek Bombatkar

A typical analytical lifecycle in data science projects starts with the process of data generation and collection, continues with data preparation and preprocessing and heads towards project specific analytics, visualizations and presentations. In order to ensure high quality trusted analytics, every relevant step of the data-model-result linkage needs to meet certain quality standards that furthermore should be certified by trusted quality gate mechanisms.We propose “blockchain-backed analytics”, a scalable and easy-to-use generic approach to introduce quality gates to data science projects, backed by the immutable records of a blockchain. For that reason, data, models and results are stored as cryptographically hashed fingerprints with mutually linked transactions in a public blockchain database.This approach enables stakeholders of data science projects to track and trace the linkage of data, applied models and modeling results without the need of trust validation of escrow systems or any other third party.


2021 ◽  
pp. 1-25
Author(s):  
Yu-Chin Hsu ◽  
Ji-Liang Shiu

Under a Mundlak-type correlated random effect (CRE) specification, we first show that the average likelihood of a parametric nonlinear panel data model is the convolution of the conditional distribution of the model and the distribution of the unobserved heterogeneity. Hence, the distribution of the unobserved heterogeneity can be recovered by means of a Fourier transformation without imposing a distributional assumption on the CRE specification. We subsequently construct a semiparametric family of average likelihood functions of observables by combining the conditional distribution of the model and the recovered distribution of the unobserved heterogeneity, and show that the parameters in the nonlinear panel data model and in the CRE specification are identifiable. Based on the identification result, we propose a sieve maximum likelihood estimator. Compared with the conventional parametric CRE approaches, the advantage of our method is that it is not subject to misspecification on the distribution of the CRE. Furthermore, we show that the average partial effects are identifiable and extend our results to dynamic nonlinear panel data models.


2021 ◽  
Author(s):  
Matthias Held ◽  
Grit Laudel ◽  
Jochen Gläser

AbstractIn this paper we utilize an opportunity to construct ground truths for topics in the field of atomic, molecular and optical physics. Our research questions in this paper focus on (i) how to construct a ground truth for topics and (ii) the suitability of common algorithms applied to bibliometric networks to reconstruct these topics. We use the ground truths to test two data models (direct citation and bibliographic coupling) with two algorithms (the Leiden algorithm and the Infomap algorithm). Our results are discomforting: none of the four combinations leads to a consistent reconstruction of the ground truths. No combination of data model and algorithm simultaneously reconstructs all micro-level topics at any resolution level. Meso-level topics are not reconstructed at all. This suggests (a) that we are currently unable to predict which combination of data model, algorithm and parameter setting will adequately reconstruct which (types of) topics, and (b) that a combination of several data models, algorithms and parameter settings appears to be necessary to reconstruct all or most topics in a set of papers.


2003 ◽  
Vol 12 (03) ◽  
pp. 325-363 ◽  
Author(s):  
Joseph Fong ◽  
Qing Li ◽  
Shi-Ming Huang

Data warehouse contains vast amount of data to support complex queries of various Decision Support Systems (DSSs). It needs to store materialized views of data, which must be available consistently and instantaneously. Using a frame metadata model, this paper presents an architecture of a universal data warehousing with different data models. The frame metadata model represents the metadata of a data warehouse, which structures an application domain into classes, and integrates schemas of heterogeneous databases by capturing their semantics. A star schema is derived from user requirements based on the integrated schema, catalogued in the metadata, which stores the schema of relational database (RDB) and object-oriented database (OODB). Data materialization between RDB and OODB is achieved by unloading source database into sequential file and reloading into target database, through which an object relational view can be defined so as to allow the users to obtain the same warehouse view in different data models simultaneously. We describe our procedures of building the relational view of star schema by multidimensional SQL query, and the object oriented view of the data warehouse by Online Analytical Processing (OLAP) through method call, derived from the integrated schema. To validate our work, an application prototype system has been developed in a product sales data warehousing domain based on this approach.


2014 ◽  
Vol 635-637 ◽  
pp. 1948-1951
Author(s):  
Yao Guang Hu ◽  
Dong Feng Wu ◽  
Jing Qian Wen

On the basis of the electronic components business processes and the analysis of the quality data related, a model based on the object entity of the product life cycle is proposed. Object entity as the carrier of the related data this model mergers and reorganizes the related business, meanwhile links the entity through the revolved information of the quality data model thus achieving the integrity of the business in both time and space. This data model as the basis, can effectively realize the integration and sharing of quality data, facilitates the quality data analysis and quality traceability, and improve the capabilities of quality data management for the enterprise.


Author(s):  
Ladjel Bellatreche ◽  
Carlos Ordonez ◽  
Dominique Méry ◽  
Matteo Golfarelli ◽  
El Hassan Abdelwahed

2021 ◽  
Author(s):  
Nikolai West ◽  
Jonas Gries ◽  
Carina Brockmeier ◽  
Jens C. Gobel ◽  
Jochen Deuse

Sign in / Sign up

Export Citation Format

Share Document