Parquet Cube to store and process gridded data

Author(s):  
elisabeth lambert ◽  
Jean-michel Zigna ◽  
Thomas Zilio ◽  
Flavien Gouillon

<p>The volume of data in the earth data observation domain grows considerably, especially with the emergence of new generations of satellites providing much more precise measures and thus voluminous data and files. The ‘big data’ field provides solutions for storing and processing huge amount of data. However, there is no established consensus, neither in the industrial market nor the open source community, on big data solutions adapted to the earth data observation domain. The main difficulty is that these multi-dimensional data are not naturally scalable. CNES and CLS, driven by a CLS business needs, carried out a study to address this difficulty and try to answer it.</p><p>Two use cases have been identified, these two being complementary because at different points in the value chain: 1) the development of an altimetric processing chain, storing low level altimetric measurements from multiple satellite missions, and 2) the extraction of oceanographic environmental data along animal and ships tracks. The original data format of these environmental variables is netCDF. We will first show the state of the art of big data technologies that are adapted to this problematic and their limitations. Then, we will describe the prototypes behind both use cases and in particular how the data is split into independent chunks that then can be processed in parallel. The storage format chosen is the Apache parquet and in the first use case, the manipulation of the data is made with the xarray library while all the parallel processes are implemented with the Dask framework. An implementation using Zarr library instead of Parquet has also been developed and results will also be shown. In the second use case, the enrichment of the track with METOC (Meteo/Oceanographic) data is developed using the Spark framework. Finally, results of this second use case, that runs operationally today for the extraction of oceanographic data along tracks, will be shown. This second solution is an alternative to Pangeo solution in the world of industrial and Java development. It extends the traditional THREDDS subsetter, delivered by the Open source Unidata Community, to a bigdata implementation. This Parquet storage and associated service implements a smoothed transition of gridded data in Big Data infrastructures.</p>

2014 ◽  
Vol 23 (01) ◽  
pp. 27-35 ◽  
Author(s):  
S. de Lusignan ◽  
S-T. Liaw ◽  
C. Kuziemsky ◽  
F. Mold ◽  
P. Krause ◽  
...  

Summary Background: Generally benefits and risks of vaccines can be determined from studies carried out as part of regulatory compliance, followed by surveillance of routine data; however there are some rarer and more long term events that require new methods. Big data generated by increasingly affordable personalised computing, and from pervasive computing devices is rapidly growing and low cost, high volume, cloud computing makes the processing of these data inexpensive. Objective: To describe how big data and related analytical methods might be applied to assess the benefits and risks of vaccines. Method: We reviewed the literature on the use of big data to improve health, applied to generic vaccine use cases, that illustrate benefits and risks of vaccination. We defined a use case as the interaction between a user and an information system to achieve a goal. We used flu vaccination and pre-school childhood immunisation as exemplars. Results: We reviewed three big data use cases relevant to assessing vaccine benefits and risks: (i) Big data processing using crowd-sourcing, distributed big data processing, and predictive analytics, (ii) Data integration from heterogeneous big data sources, e.g. the increasing range of devices in the “internet of things”, and (iii) Real-time monitoring for the direct monitoring of epidemics as well as vaccine effects via social media and other data sources. Conclusions: Big data raises new ethical dilemmas, though its analysis methods can bring complementary real-time capabilities for monitoring epidemics and assessing vaccine benefit-risk balance.


2021 ◽  
Author(s):  
Thomas Jurczyk

This tutorial demonstrates how to apply clustering algorithms with Python to a dataset with two concrete use cases. The first example uses clustering to identify meaningful groups of Greco-Roman authors based on their publications and their reception. The second use case applies clustering algorithms to textual data in order to discover thematic groups. After finishing this tutorial, you will be able to use clustering in Python with Scikit-learn applied to your own data, adding an invaluable method to your toolbox for exploratory data analysis.


2021 ◽  
Author(s):  
Karamarie Fecho ◽  
Stanley C Ahalt ◽  
Steven Appold ◽  
Saravanan Arunachalam ◽  
Emily Pfaff ◽  
...  

BACKGROUND The Integrated Clinical and Environmental Exposures Service (ICEES) serves as an open-source, disease-agnostic, regulatory-compliant framework and approach for openly exposing and exploring clinical data that have been integrated at the patient level with a variety of environmental exposures data. ICEES is equipped with tools to support basic statistical exploration of the integrated data in a completely open manner. OBJECTIVE This study aims to further develop and apply ICEES as a novel tool for openly exposing and exploring integrated clinical and environmental data. We focus on an asthma use case. METHODS We queried the ICEES open application programming interface using a functionality that supports Chi Square tests between feature variables and a primary outcome measure, with a Bonferroni correction for multiple comparisons (α=.001). We focused on two primary outcomes that are indicative of asthma exacerbations: annual emergency department (ED) or inpatient visits for respiratory issues; and annual prescriptions for prednisone. RESULTS Of the N = 157,410 patients within the asthma cohort, N = 26,332 patients (16.05%) had one or more annual emergency department or inpatient visits for respiratory issues, and N = 17,056 patients (10.40%) had one or more annual prescriptions for prednisone. We found that close proximity to a major roadway or highway, exposure to high levels of PM2.5 or ozone, female sex, Caucasian race, low residential density, lack of health insurance, and low household income were significantly associated with asthma exacerbations (P<.001). Asthma exacerbations did not vary by rural vs urban residence. Moreover, the results were largely consistent across outcome measures. CONCLUSIONS Our results demonstrate that ICEES can be used to replicate and extend published findings on factors that influence asthma exacerbations. As a disease-agnostic, open-source approach for integrating, exposing, and exploring patient-level clinical and environmental exposures data, we believe that ICEES will have broad adoption by other institutions and application in environmental health and other biomedical fields.


2021 ◽  
Vol 12 (1) ◽  
pp. 13
Author(s):  
Christoph Neuner ◽  
Roland Coras ◽  
Ingmar Blümcke ◽  
Alexander Popp ◽  
Sven M. Schlaffer ◽  
...  

Background: Processing whole-slide images (WSI) to train neural networks can be intricate and labor intensive. We developed an open-source library dealing with recurrent tasks in the processing of WSI and helping with the training and evaluation of neuronal networks for classification tasks. Methods: Two histopathology use-cases were selected and only hematoxylin and eosin (H&E) stained slides were used. The first use case was a two-class classification problem. We trained a convolutional neuronal network (CNN) to distinguish between dysembryoplastic neuroepithelial tumor (DNET) and ganglioglioma (GG), two neuropathological low-grade epilepsy-associated tumor entities. Within the second use case, we included four clinicopathological disease conditions in a multilabel approach. Here we trained a CNN to predict the hormone expression profile of pituitary adenomas. In the same approach, we also predicted clinically silent corticotroph adenoma. Results: Our DNET-GG classifier achieved an AUC of 1.00 for the ROC curve. For the second use case, the best performing CNN achieved an area under the curve (AUC) of 0.97 for the receiver operating characteristic (ROC) for corticotroph adenoma, 0.86 for silent corticotroph adenoma, and 0.98 for gonadotroph adenoma. All scores were calculated with the help of our library on predictions on a case basis. Conclusions: Our comprehensive and fastai-compatible library is helpful to standardize the workflow and minimize the burden of training a CNN. Indeed, our trained CNNs extracted neuropathologically relevant information from the WSI. This approach will supplement the clinicopathological diagnosis of brain tumors, which is currently based on cost-intensive microscopic examination and variable panels of immunohistochemical stainings.


2021 ◽  
Vol 2 ◽  
Author(s):  
Janis Rosskamp ◽  
Hermann Meißenhelter ◽  
Rene Weller ◽  
Marc O. Rüdel ◽  
Johannes Ganser ◽  
...  

We present UnrealHaptics, a plugin-architecture that enables advanced virtual reality (VR) interactions, such as haptics or grasping in modern game engines. The core is a combination of a state-of-the-art collision detection library with support for very fast and stable force and torque computations and a general device plugin for communication with different input/output hardware devices, such as haptic devices or Cybergloves. Our modular and lightweight architecture makes it easy for other researchers to adapt our plugins to their requirements. We prove the versatility of our plugin architecture by providing two use cases implemented in the Unreal Engine 4 (UE4). In the first use case, we have tested our plugin with a haptic device in different test scenes. For the second use case, we show a virtual hand grasping an object with precise collision detection and handling multiple contacts. We have evaluated the performance in our use cases. The results show that our plugin easily meets the requirements of stable force rendering at 1 kHz for haptic rendering even in highly non-convex scenes, and it can handle the complex contact scenarios of virtual grasping.


Author(s):  
Christos Katrakazas ◽  
Natalia Sobrino ◽  
Ilias Trochidis ◽  
Jose Manuel Vassallo ◽  
Stratos Arampatzis ◽  
...  

2017 ◽  
Vol 21 (3) ◽  
pp. 623-639 ◽  
Author(s):  
Tingting Zhang ◽  
William Yu Chung Wang ◽  
David J. Pauleen

Purpose This paper aims to investigate the value of big data investments by examining the market reaction to company announcements of big data investments and tests the effect for firms that are either knowledge intensive or not. Design/methodology/approach This study is based on an event study using data from two stock markets in China. Findings The stock market sees an overall index increase in stock prices when announcements of big data investments are revealed by grouping all the listed firms included in the sample. Increased stock prices are also the case for non-knowledge intensive firms. However, the stock market does not seem to react to big data investment announcements by testing the knowledge intensive firms along. Research limitations/implications This study contributes to the literature on assessing the economic value of big data investments from the perspective of big data information value chain by taking an unexpected change in stock price as the measure of the financial performance of the investment and by comparing market reactions between knowledge intensive firms and non-knowledge intensive firms. Findings of this study can be used to refine practitioners’ understanding of the economic value of big data investments to different firms and provide guidance to their future investments in knowledge management to maximize the benefits along the big data information value chain. However, findings of study should be interpreted carefully when applying them to companies that are not publicly traded on the stock market or listed on other financial markets. Originality/value Based on the concept of big data information value chain, this study advances research on the economic value of big data investments. Taking the perspective of stock market investors, this study investigates how the stock market reacts to big data investments by comparing the reactions to knowledge-intensive firms and non-knowledge-intensive firms. The results may be particularly interesting to those publicly traded companies that have not previously invested in knowledge management systems. The findings imply that stock investors tend to believe that big data investment could possibly increase the future returns for non-knowledge-intensive firms.


Electronics ◽  
2021 ◽  
Vol 10 (5) ◽  
pp. 592
Author(s):  
Radek Silhavy ◽  
Petr Silhavy ◽  
Zdenka Prokopova

Software size estimation represents a complex task, which is based on data analysis or on an algorithmic estimation approach. Software size estimation is a nontrivial task, which is important for software project planning and management. In this paper, a new method called Actors and Use Cases Size Estimation is proposed. The new method is based on the number of actors and use cases only. The method is based on stepwise regression and led to a very significant reduction in errors when estimating the size of software systems compared to Use Case Points-based methods. The proposed method is independent of Use Case Points, which allows the elimination of the effect of the inaccurate determination of Use Case Points components, because such components are not used in the proposed method.


Geosciences ◽  
2021 ◽  
Vol 11 (2) ◽  
pp. 48
Author(s):  
Margaret F.J. Dolan ◽  
Rebecca E. Ross ◽  
Jon Albretsen ◽  
Jofrid Skarðhamar ◽  
Genoveva Gonzalez-Mirelis ◽  
...  

The use of habitat distribution models (HDMs) has become common in benthic habitat mapping for combining limited seabed observations with full-coverage environmental data to produce classified maps showing predicted habitat distribution for an entire study area. However, relatively few HDMs include oceanographic predictors, or present spatial validity or uncertainty analyses to support the classified predictions. Without reference studies it can be challenging to assess which type of oceanographic model data should be used, or developed, for this purpose. In this study, we compare biotope maps built using predictor variable suites from three different oceanographic models with differing levels of detail on near-bottom conditions. These results are compared with a baseline model without oceanographic predictors. We use associated spatial validity and uncertainty analyses to assess which oceanographic data may be best suited to biotope mapping. Our results show how spatial validity and uncertainty metrics capture differences between HDM outputs which are otherwise not apparent from standard non-spatial accuracy assessments or the classified maps themselves. We conclude that biotope HDMs incorporating high-resolution, preferably bottom-optimised, oceanography data can best minimise spatial uncertainty and maximise spatial validity. Furthermore, our results suggest that incorporating coarser oceanographic data may lead to more uncertainty than omitting such data.


Sign in / Sign up

Export Citation Format

Share Document