array databases Latest Research Papers

Scientific data is mainly multidimensional in its nature, presenting interesting opportunities for optimizations when managed by array databases. However, in scenarios where data is sparse, an efficient implementation is still required. In this paper, we investigate the adoption of the Ph-tree as an in-memory indexing structure for sparse data. We compare the performance in data ingestion and in both range and punctual queries, using SAVIME as the multidimensional array DBMS. Our experiments, using a real weather dataset, highlights the challenges involving providing a fast data ingestion, as proposed by SAVIME, and at the same time efficiently answering multidimensional queries on sparse data.

Download Full-text

Subarray Skyline Query Processing in Array Databases

33rd International Conference on Scientific and Statistical Database Management ◽

10.1145/3468791.3468799 ◽

2021 ◽

Author(s):

Dalsu Choi ◽

Hyunsik Yoon ◽

Yon Dohn Chung

Keyword(s):

Query Processing ◽

Skyline Query ◽

Skyline Query Processing ◽

Array Databases

Download Full-text

Towards AI in Array Databases

10.5194/egusphere-egu21-7409 ◽

2021 ◽

Author(s):

Otoniel José Campos Escobar ◽

Peter Baumann

Keyword(s):

Machine Learning ◽

Prediction Models ◽

Query Language ◽

Data Retrieval ◽

Detailed Comparison ◽

Sensor Data ◽

Valuable Insight ◽

Server Side ◽

Rainfall Prediction ◽

Array Databases

Multi-dimensional arrays (also known as raster data, gridded data, or datacubes) are key, if not essential, in many science and engineering domains. In the case of Earth sciences, a significant amount of the data that is produced falls into the category of array data. That being said, the amount of data that is produced daily from this field is huge. This makes it hard for researchers to analyze and retrieve any valuable insight from it. 1-D sensor data, 2-D satellite imagery, 3-D x/y/t image time series and x/y/z subsurface voxel data, 4-D x/y/z/t atmospheric and ocean data often produce dozens of Terabytes of data every day, and the rate is only expected to increase in the future. In response, Array Databases systems were specifically designed and constructed to provide modeling, storage, and processing support for multi-dimensional arrays. They offer a declarative query language for flexible data retrieval and some, e.g., rasdaman, provide federation processing and standard-based query capabilities compliant with OGC standards&#160;such as WCS, WCPS, and WMS. However, despite these advances, the gap between efficient information retrieval and the actual application of this data remains very broad, especially in the domain of artificial intelligence AI and machine learning ML.In this contribution, we present the state-of-art in performing ML through Array Databases. First, a motivating example is introduced from the Deep Rain Project which aims at enhancing rainfall prediction accuracy in mountainous areas by implementing ML code on top of an Array Database. Deep Rain also explores novel methods for training prediction models by implementing server-side ML processing inside the database. A brief introduction of the Array Database rasdaman that is used in this project is also provided featuring its standard-based query capabilities and scalable federation processing features that are required for rainfall data processing. Next, the workflow approach for ML and Array Databases that is employed in the Deep Rain project is described in detail listing the benefits of using an Array Database with declarative query language capabilities in the machine learning pipeline. A concrete use case will be used to illustrate step by step how these tools integrate. Next, an alternative approach will be presented where ML is done inside the Array Database using user-defined functions UDFs. Finally,&#160; a detailed comparison between the UDF and workflow approach is presented explaining their challenges and benefits.

Download Full-text

Array databases: concepts, standards, implementations

Journal Of Big Data ◽

10.1186/s40537-020-00399-2 ◽

2021 ◽

Vol 8 (1) ◽

Author(s):

Peter Baumann ◽

Dimitar Misev ◽

Vlad Merticariu ◽

Bang Pham Huu

Keyword(s):

Service Quality ◽

Ad Hoc ◽

Query Language ◽

Distributed Processing ◽

Database Systems ◽

Database Technology ◽

Comprehensive Survey ◽

Spatio Temporal ◽

And Performance ◽

Array Databases

AbstractMulti-dimensional arrays (also known as raster data or gridded data) play a key role in many, if not all science and engineering domains where they typically represent spatio-temporal sensor, image, simulation output, or statistics “datacubes”. As classic database technology does not support arrays adequately, such data today are maintained mostly in silo solutions, with architectures that tend to erode and not keep up with the increasing requirements on performance and service quality. Array Database systems attempt to close this gap by providing declarative query support for flexible ad-hoc analytics on large n-D arrays, similar to what SQL offers on set-oriented data, XQuery on hierarchical data, and SPARQL and CIPHER on graph data. Today, Petascale Array Database installations exist, employing massive parallelism and distributed processing. Hence, questions arise about technology and standards available, usability, and overall maturity. Several papers have compared models and formalisms, and benchmarks have been undertaken as well, typically comparing two systems against each other. While each of these represent valuable research to the best of our knowledge there is no comprehensive survey combining model, query language, architecture, and practical usability, and performance aspects. The size of this comparison differentiates our study as well with 19 systems compared, four benchmarked to an extent and depth clearly exceeding previous papers in the field; for example, subsetting tests were designed in a way that systems cannot be tuned to specifically these queries. It is hoped that this gives a representative overview to all who want to immerse into the field as well as a clear guidance to those who need to choose the best suited datacube tool for their application. This article presents results of the Research Data Alliance (RDA) Array Database Assessment Working Group (ADA:WG), a subgroup of the Big Data Interest Group. It has elicited the state of the art in Array Databases, technically supported by IEEE GRSS and CODATA Germany, to answer the question: how can data scientists and engineers benefit from Array Database technology? As it turns out, Array Databases can offer significant advantages in terms of flexibility, functionality, extensibility, as well as performance and scalability—in total, the database approach of offering “datacubes” analysis-ready heralds a new level of service quality. Investigation shows that there is a lively ecosystem of technology with increasing uptake, and proven array analytics standards are in place. Consequently, such approaches have to be considered a serious option for datacube services in science, engineering and beyond. Tools, though, vary greatly in functionality and performance as it turns out.

Download Full-text

On the Integration of Machine Learning and Array Databases

2020 IEEE 36th International Conference on Data Engineering (ICDE) ◽

10.1109/icde48307.2020.00170 ◽

2020 ◽

Author(s):

Sebastian Villarroya ◽

Peter Baumann

Keyword(s):

Machine Learning ◽

Array Databases

Download Full-text

United in Variety: The EarthServer Datacube Federation

10.5194/egusphere-egu2020-10849 ◽

2020 ◽

Author(s):

Peter Baumann

Keyword(s):

Data Center ◽

Data Centers ◽

Distributed Data ◽

Temporal Data ◽

Atmospheric Data ◽

Elevation Data ◽

Technology Services ◽

Spatio Temporal ◽

Data Location ◽

Array Databases

Datacubes form an accepted cornerstone for analysis (and visualization) ready spatio-temporal data offerings. Beyond the multi-dimensional data structure, the paradigm also suggests rich services, abstracting away from the untractable zillions of files and products - actionable datacubes as established by Array Databases enable users to ask "any query, any time" without programming. The principle of location-transparent federations establishes a single, coherent information space.The EarthServer federation is a large, growing data center network offering Petabytes of a critical variety, such as radar and optical satellite data, atmospheric data, elevation data, and thematic cubes like global sea ice. Around CODE-DE and DIASs an ecosystem of data has been established that is available to users as a single pool, in particular for efficient distributed data fusion irrespective of data location.In our talk we present technology, services, and governance of this unique intercontinental line-up of data centers. A live demo will show dist ributed datacube fusion.&#160;

Download Full-text