scholarly journals Maritime Data Processing in Relational Databases

2021 ◽  
pp. 73-118
Author(s):  
Laurent Etienne ◽  
Cyril Ray ◽  
Elena Camossi ◽  
Clément Iphar
2018 ◽  
Vol 10 (3) ◽  
pp. 76-90
Author(s):  
Ye Tao ◽  
Xiaodong Wang ◽  
Xiaowei Xu

This article describes how rapidly growing data volumes require systems that have the ability to handle massive heterogeneous unstructured data sets. However, most existing mature transaction processing systems are built upon relational databases with structured data. In this article, the authors design a hybrid development framework, to offer greater scalability and flexibility of data analysis and reporting, while keeping maximum compatibility and links to the legacy platforms on which transaction business logics run. Data, service and user interfaces are implemented as a toolset stack, for developing applications with functionalities of information retrieval, data processing, analyzing and visualizing. A use case of healthcare data integration is presented as an example, where information is collected and aggregated from diverse sources. The workflow and simulation of data processing and visualization are also discussed, to validate the effectiveness of the proposed framework.


Author(s):  
Dietmar Wolfram

Many informetric phenomena lend themselves to ready adaptation to relational DBMS environments. SQL, the standard language used for constructing and querying relational databases, provides useful tools for processing of informetric data. The author demonstrates the applications and some limitations of SQL for efficient organization and tabulation of raw informetric data.Plusieurs phénomènes infométriques se prêtent à une adaptation rapide aux environnements relationnels DBMS. SQL, le langage standard utilisé pour la construction et la recherche des bases de données relationnelles, offre des outils utiles pour le traitement des données infométriques. L’auteur démontre les applications et certaines restrictions de SQL pour l’organisation efficace et la tabulation des données infométriques brutes. 


2018 ◽  
Vol 3 (5) ◽  
pp. 71-75
Author(s):  
Mária Princz

The database management, using relational databases, is part of curriculum in the Hungarian high schools. The aim of this paper is to present how we can show for students the challenges facing data processing, data retrieval, beyond the relational database management taught in high school.


Author(s):  
Yi-Cheng Tu ◽  
Gang Ding

Database administration (tuning) is the process of adjusting database configurations in order to accomplish desirable performance goals. This job is performed by human operators called database administrators (DBAs) who are generally well-paid, and are becoming more and more expensive with the increasing complexity and scale of modern databases. There has been considerable effort dedicated to reducing such cost (which often dominates the total ownership cost of missioncritical databases) by making database tuning more automated and transparent to users (Chaudhuri et al, 2004; Chaudhuri and Weikum, 2006). Research in this area seeks ways to automate the hardware deployment, physical database design, parameter configuration, and resource management in such systems. The goal is to achieve acceptable performance on the whole system level without (or with limited) human intervention. According to Weikum et al. (2002), problems in this category can be stated as: workload × configuration (?) ? performance which means that, given the features of the incoming workload to the database, we are to find the right settings for all system knobs such that the performance goals are satisfied The following two are representatives of a series of such tuning problems in different databases: • Problem 1: Maintenance of multi-class servicelevel agreements (SLA) in relational databases. Database service providers usually offer various levels of performance guarantees to requests from different groups of customers. Fulfillment of such guarantees (SLAs) is accomplished by allocating different amounts of system resources to different queries. For example, query response time is negatively related to the amount of memory buffer assigned to that query. We need to dynamically allocate memory to individual queries such that the absolute or relative response times of queries from different users are satisfied. • Problem 2: Load shedding in stream databases. Stream databases are used for processing data generated continuously from sources such as a sensor network. In streaming databases, data processing delay, i.e., the time consumed to process a data point, is the most critical performance metric (Tatbul et al., 2003). The ability to remain within a desired level of delay is significantly hampered under situations of overloading (caused by bursty data arrivals and time-varying unit data processing cost). When overloaded, some data is discarded (i.e., load shedding) in order to keep pace with the incoming load. The system needs to continuously adjust the amount of data to be discarded such that 1) delay is maintained under a desirable level; 2) data is not discarded unnecessarily.


Author(s):  
Chun-Hsiung Tseng

Although keyword-based search algorithms usually do their jobs well, they may sometimes yield weird results. Despite of the fact that theWeb is the largest database, comparing to relational databases, the set of search operations for the Web is still primitive. This paper proposes two ways to remedy this: first, advanced information sources should be created. The role of advanced information sources of the Web is analogous to views of relational databases. Second, we propose several data processing tools based on the concept of advanced information sources. With these two mechanisms, the researcher tries to distinguish the data-centric view from the presentation view of the Web. In the paper, both the concept and an implementation are proposed.


2018 ◽  
Vol 2 ◽  
pp. 31 ◽  
Author(s):  
Greg Finak ◽  
Bryan Mayer ◽  
William Fulp ◽  
Paul Obrecht ◽  
Alicia Sato ◽  
...  

A central tenet of reproducible research is that scientific results are published along with the underlying data and software code necessary to reproduce and verify the findings. A host of tools and software have been released that facilitate such work-flows and scientific journals have increasingly demanded that code and primary data be made available with publications. There has been little practical advice on implementing reproducible research work-flows for large ’omics’ or systems biology data sets used by teams of analysts working in collaboration. In such instances it is important to ensure all analysts use the same version of a data set for their analyses. Yet, instantiating relational databases and standard operating procedures can be unwieldy, with high "startup" costs and poor adherence to procedures when they deviate substantially from an analyst’s usual work-flow. Ideally a reproducible research work-flow should fit naturally into an individual’s existing work-flow, with minimal disruption. Here, we provide an overview of how we have leveraged popular open source tools, including Bioconductor, Rmarkdown, git version control, R, and specifically R’s package system combined with a new tool DataPackageR, to implement a lightweight reproducible research work-flow for preprocessing large data sets, suitable for sharing among small-to-medium sized teams of computational scientists. Our primary contribution is the DataPackageR tool, which decouples time-consuming data processing from data analysis while leaving a traceable record of how raw data is processed into analysis-ready data sets. The software ensures packaged data objects are properly documented and performs checksum verification of these along with basic package version management, and importantly, leaves a record of data processing code in the form of package vignettes. Our group has implemented this work-flow to manage, analyze and report on pre-clinical immunological trial data from multi-center, multi-assay studies for the past three years.


2018 ◽  
Author(s):  
Greg Finak ◽  
Bryan T. Mayer ◽  
William Fulp ◽  
Paul Obrecht ◽  
Alicia Sato ◽  
...  

AbstractA central tenet of reproducible research is that scientific results are published along with the underlying data and software code necessary to reproduce and verify the findings. A host of tools and software have been released that facilitate such work-flows and scientific journals have increasingly demanded that code and primary data be made available with publications. There has been little practical advice on implementing reproducible research work-flows for large ‘omics’ or systems biology data sets used by teams of analysts working in collaboration. In such instances it is important to ensure all analysts use the same version of a data set for their analyses. Yet, instantiating relational databases and standard operating procedures can be unwieldy, with high “startup” costs and poor adherence to procedures when they deviate substantially from an analyst’s usual work-flow. Ideally a reproducible research work-flow should fit naturally into an individual’s existing work-flow, with minimal disruption. Here, we provide an overview of how we have leveraged popular open source tools, including Bioconductor, Rmarkdown, git version control, R, and specifically R’s package system combined with a new tool DataPackageR, to implement a lightweight reproducible research work-flow for preprocessing large data sets, suitable for sharing among small-to-medium sized teams of computational scientists. Our primary contribution is the DataPackageR tool, which decouples time-consuming data processing from data analysis while leaving a traceable record of how raw data is processed into analysis-ready data sets. The software ensures packaged data objects are properly documented and performs checksum verification of these along with basic package version management, and importantly, leaves a record of data processing code in the form of package vignettes. Our group has implemented this work-flow to manage, analyze and report on pre-clinical immunological trial data from multi-center, multi-assay studies for the past three years.


2021 ◽  
Vol 14 (13) ◽  
pp. 3362-3375
Author(s):  
Remmelt Ammerlaan ◽  
Gilbert Antonius ◽  
Marc Friedman ◽  
H M Sajjad Hossain ◽  
Alekh Jindal ◽  
...  

Modern data processing systems require optimization at massive scale, and using machine learning to optimize these systems (ML-for-systems) has shown promising results. Unfortunately, ML-for-systems is subject to over generalizations that do not capture the large variety of workload patterns, and tend to augment the performance of certain subsets in the workload while regressing performance for others. In this paper, we introduce a performance safeguard system, called PerfGuard , that designs pre-production experiments for deploying ML-for-systems. Instead of searching the entire space of query plans (a well-known, intractable problem), we focus on query plan deltas (a significantly smaller space). PerfGuard formalizes these differences, and correlates plan deltas to important feedback signals, like execution cost. We describe the deep learning architecture and the end-to-end pipeline in PerfGuard that could be used with general relational databases. We show that this architecture improves on baseline models, and that our pipeline identifies key query plan components as major contributors to plan disparity. Offline experimentation shows PerfGuard as a promising approach, with many opportunities for future improvement.


2020 ◽  
Vol 9 (5) ◽  
pp. 331
Author(s):  
Dongming Guo ◽  
Erling Onstein

Geospatial information has been indispensable for many application fields, including traffic planning, urban planning, and energy management. Geospatial data are mainly stored in relational databases that have been developed over several decades, and most geographic information applications are desktop applications. With the arrival of big data, geospatial information applications are also being modified into, e.g., mobile platforms and Geospatial Web Services, which require changeable data schemas, faster query response times, and more flexible scalability than traditional spatial relational databases currently have. To respond to these new requirements, NoSQL (Not only SQL) databases are now being adopted for geospatial data storage, management, and queries. This paper reviews state-of-the-art geospatial data processing in the 10 most popular NoSQL databases. We summarize the supported geometry objects, main geometry functions, spatial indexes, query languages, and data formats of these 10 NoSQL databases. Moreover, the pros and cons of these NoSQL databases are analyzed in terms of geospatial data processing. A literature review and analysis showed that current document databases may be more suitable for massive geospatial data processing than are other NoSQL databases due to their comprehensive support for geometry objects and data formats and their performance, geospatial functions, index methods, and academic development. However, depending on the application scenarios, graph databases, key-value, and wide column databases have their own advantages.


Sign in / Sign up

Export Citation Format

Share Document