A Python-oriented environment for climate experiments at scale in the frame of the European Open Science Cloud

Scientific data analysis experiments and applications require software capable of handling domain-specific and data-intensive workflows. The increasing volume of scientific data is further exacerbating these data management and analytics challenges, pushing the community towards the definition of novel programming environments for dealing efficiently with complex experiments, while abstracting from the underlying computing infrastructure.&#160;ECASLab provides a user-friendly data analytics environment to support scientists in their daily research activities, in particular in the climate change domain, by integrating analysis tools with scientific datasets (e.g., from the ESGF data archive) and computing resources (i.e., Cloud and HPC-based). It combines the features of the ENES Climate Analytics Service (ECAS) and the JupyterHub service, with a wide set of scientific libraries from the Python landscape for data manipulation, analysis and visualization. ECASLab is being set up in the frame of the European Open Science Cloud (EOSC) platform - in the EU H2020 EOSC-Hub project - by CMCC (https://ecaslab.cmcc.it/) and DKRZ (https://ecaslab.dkrz.de/), which host two major instances of the environment.&#160;ECAS, which lies at the heart of ECASLab, enables scientists to perform data analysis experiments on large volumes of multi-dimensional data by providing a workflow-oriented, PID-supported, server-side and distributed computing approach. ECAS consists of multiple components, centered around the Ophidia High Performance Data Analytics framework, which has been integrated with data access and sharing services (e.g., EUDAT B2DROP/B2SHARE, Onedata), along with the EGI federated cloud infrastructure. The integration with JupyterHub provides a convenient interface for scientists to access the ECAS features for the development and execution of experiments, as well as for sharing results (and the experiment/workflow definition itself). ECAS parallel data analytics capabilities can be easily exploited in Jupyter Notebooks (by means of PyOphidia, the Ophidia Python bindings) together with well-known Python modules for processing and for plotting the results on charts and maps (e.g., Dask, Xarray, NumPy, Matplotlib, etc.). ECAS is also one of the compute services made available to climate scientists by the EU H2020 IS-ENES3 project.&#160;Hence, this integrated environment represents a complete software stack for the design and run of interactive experiments as well as complex and data-intensive workflows. One class of such large-scale workflows, efficiently implemented through the environment resources, refers to multi-model data analysis in the context of both CMIP5 and CMIP6 (i.e., precipitation trend analysis orchestrated in parallel over multiple CMIP-based datasets).

Download Full-text

A Query Processing Framework for Large-Scale Scientific Data Analysis

Lecture Notes in Computer Science - Transactions on Large-Scale Data- and Knowledge-Centered Systems XXXVIII ◽

10.1007/978-3-662-58384-5_5 ◽

2018 ◽

pp. 119-145

Author(s):

Leonidas Fegaras

Keyword(s):

Data Analysis ◽

Query Processing ◽

Large Scale ◽

Scientific Data ◽

Scientific Data Analysis ◽

Processing Framework

Download Full-text

VEGA, AN ENVIRONMENT FOR GRAVITATIONAL WAVES DATA ANALYSIS

International Journal of Modern Physics D ◽

10.1142/s021827180000030x ◽

2000 ◽

Vol 09 (03) ◽

pp. 293-297 ◽

Cited By ~ 3

Author(s):

D. BUSKULIC ◽

L. DEROME ◽

R. FLAMINIO ◽

F. MARION ◽

L. MASSONET ◽

...

Keyword(s):

Data Analysis ◽

Gravitational Wave ◽

Gravitational Waves ◽

Large Scale ◽

Data Access ◽

The World ◽

Gravitational Wave Detectors ◽

Batch Data ◽

New Generation

A new generation of large scale and complex Gravitational Wave detectors is building up. They will produce big amount of data and will require intensive and specific interactive/batch data analysis. We will present VEGA, a framework for such data analysis, based on ROOT. VEGA uses the Frame format defined as standard by GW groups around the world. Furthermore, new tools are developed in order to facilitate data access and manipulation, as well as interface with existing algorithms. VEGA is currently evaluated by the VIRGO experiment.

Download Full-text

A Virtual Dataspaces Model for large-scale materials scientific data access

Future Generation Computer Systems ◽

10.1016/j.future.2015.05.004 ◽

2016 ◽

Vol 54 ◽

pp. 456-468 ◽

Cited By ~ 7

Author(s):

Changjun Hu ◽

Yang Li ◽

Xin Cheng ◽

Zhenyu Liu

Keyword(s):

Large Scale ◽

Data Access ◽

Scientific Data

Download Full-text

Optimizing Workflow Data Footprint

Scientific Programming ◽

10.1155/2007/701609 ◽

2007 ◽

Vol 15 (4) ◽

pp. 249-268 ◽

Cited By ~ 19

Author(s):

Gurmeet Singh ◽

Karan Vahi ◽

Arun Ramakrishnan ◽

Gaurang Mehta ◽

Ewa Deelman ◽

...

Keyword(s):

Data Storage ◽

Large Scale ◽

Open Science ◽

Scale Production ◽

Distributed Resources ◽

Data Intensive ◽

Large Scale Production ◽

Data Files ◽

The Cost ◽

Open Science Grid

In this paper we examine the issue of optimizing disk usage and scheduling large-scale scientific workflows onto distributed resources where the workflows are data-intensive, requiring large amounts of data storage, and the resources have limited storage resources. Our approach is two-fold: we minimize the amount of space a workflow requires during execution by removing data files at runtime when they are no longer needed and we demonstrate that workflows may have to be restructured to reduce the overall data footprint of the workflow. We show the results of our data management and workflow restructuring solutions using a Laser Interferometer Gravitational-Wave Observatory (LIGO) application and an astronomy application, Montage, running on a large-scale production grid-the Open Science Grid. We show that although reducing the data footprint of Montage by 48% can be achieved with dynamic data cleanup techniques, LIGO Scientific Collaboration workflows require additional restructuring to achieve a 56% reduction in data space usage. We also examine the cost of the workflow restructuring in terms of the application's runtime.

Download Full-text

Open Science in Data-Intensive Psychology and Cognitive Science

Policy Insights from the Behavioral and Brain Sciences ◽

10.1177/2372732218790283 ◽

2019 ◽

Vol 6 (1) ◽

pp. 47-55 ◽

Cited By ~ 1

Author(s):

Alexandra Paxton ◽

Alexa Tullett

Keyword(s):

Cognitive Science ◽

Large Scale ◽

Scientific Discovery ◽

Open Science ◽

Human Cognition ◽

Incentive Structure ◽

Data Intensive ◽

Large Scale Data ◽

Share Data ◽

And Behavior

Today, researchers can collect, analyze, and share more data than ever before. Not only does increasing technological capacity open the door to new data-intensive perspectives in cognitive science and psychology (i.e., research that takes advantage of complex or large-scale data to understand human cognition and behavior), but increasing connectedness has sparked exponential increases in the ease and practice of scientific transparency. The growing open science movement encourages researchers to share data, materials, methods, and publications with other scientists and the wider public. Open science benefits data-intensive psychological science, the public, and public policy, and we present recommendations to improve the adoption of open science practices by changing the academic incentive structure and by improving the education pipeline. Despite ongoing questions about implementing open science guidelines, policy makers have an unprecedented opportunity to shape the next frontier of scientific discovery.

Download Full-text

LARGE SCALE PHENOTYPING AND DATA ANALYSIS OF PEPPER GENOTYPES IN THE EU-SPICY PROJECT

Acta Horticulturae ◽

10.17660/actahortic.2012.929.44 ◽

2012 ◽

pp. 299-306 ◽

Cited By ~ 1

Author(s):

J.A. Dieleman ◽

J.J. Magan ◽

A.M. Wubs ◽

A. Palloix ◽

S. Lenk ◽

...

Keyword(s):

Data Analysis ◽

Large Scale ◽

The Eu

Download Full-text

Computation semantics of the functional scientific workflow language Cuneiform

Journal of Functional Programming ◽

10.1017/s0956796817000119 ◽

2017 ◽

Vol 27 ◽

Cited By ~ 8

Author(s):

JÖRGEN BRANDT ◽

WOLFGANG REISIG ◽

ULF LESER

Keyword(s):

Functional Programming ◽

Large Scale ◽

Type System ◽

Black Box ◽

Scientific Data ◽

Scientific Workflow ◽

Simple Type ◽

Flexible Assembly ◽

Data Intensive ◽

Research Areas

AbstractCuneiform is a minimal functional programming language for large-scale scientific data analysis. Implementing a strict black-box view on external operators and data, it allows the direct embedding of code in a variety of external languages like Python or R, provides data-parallel higher order operators for processing large partitioned data sets, allows conditionals and general recursion, and has a naturally parallelizable evaluation strategy suitable for multi-core servers and distributed execution environments like Hadoop, HTCondor, or distributed Erlang. Cuneiform has been applied in several data-intensive research areas including remote sensing, machine learning, and bioinformatics, all of which critically depend on the flexible assembly of pre-existing tools and libraries written in different languages into complex pipelines. This paper introduces the computation semantics for Cuneiform. It presents Cuneiform's abstract syntax, a simple type system, and the semantics of evaluation. Providing an unambiguous specification of the behavior of Cuneiform eases the implementation of interpreters which we showcase by providing a concise reference implementation in Erlang. The similarity of Cuneiform's syntax to the simply typed lambda calculus puts Cuneiform in perspective and allows a straightforward discussion of its design in the context of functional programming. Moreover, the simple type system allows the deduction of the language's safety up to black-box operators. Last, the formulation of the semantics also permits the verification of compilers to and from other workflow languages.

Download Full-text

Open Science in Data-Intensive Psychology and Cognitive Science

10.31234/osf.io/pekjn ◽

2018 ◽

Author(s):

Alexandra Paxton ◽

Alexa Mary Tullett

Keyword(s):

Cognitive Science ◽

Large Scale ◽

Scientific Discovery ◽

Open Science ◽

Human Cognition ◽

Incentive Structure ◽

Data Intensive ◽

Large Scale Data ◽

Share Data ◽

And Behavior

Today, researchers can collect, analyze, and share more data than ever before. Not only does increasing technological capacity open the door to new data-intensive perspectives in cognitive science and psychology (that is, research that takes advantage of complex or large-scale data to understand human cognition and behavior), but increasing connectedness has sparked exponential increases in the ease and practice of scientific transparency. The growing open science movement encourages researchers to share data, materials, methods, and publications with other scientists and the wider public. Open science benefits data-intensive psychological science, the public, and public policy, and we present recommendations to improve the adoption of open science practices by changing the academic incentive structure and by improving the education pipeline. Despite ongoing questions about implementing open-science guidelines, policymakers have an unprecedented opportunity to shape the next frontier of scientific discovery.

Download Full-text

Semantic segmentation of microscopic neuroanatomical data by combining topological priors with encoder-decoder deep networks

10.1101/2020.02.18.955237 ◽

2020 ◽

Cited By ~ 1

Author(s):

Samik Banerjee ◽

Lucas Magee ◽

Dingkang Wang ◽

Xu Li ◽

Bingxing Huo ◽

...

Keyword(s):

Data Analysis ◽

High Performance ◽

Large Scale ◽

Image Data ◽

Semantic Segmentation ◽

Scientific Data ◽

Error Rates ◽

Topological Data Analysis ◽

Hybrid Architecture ◽

Deep Networks

Understanding of neuronal circuitry at cellular resolution within the brain has relied on tract tracing methods which involve careful observation and interpretation by experienced neuroscientists. With recent developments in imaging and digitization, this approach is no longer feasible with the large scale (terabyte to petabyte range) images. Machine learning based techniques, using deep networks, provide an efficient alternative to the problem. However, these methods rely on very large volumes of annotated images for training and have error rates that are too high for scientific data analysis, and thus requires a significant volume of human-in-the-loop proofreading. Here we introduce a hybrid architecture combining prior structure in the form of topological data analysis methods, based on discrete Morse theory, with the best-in-class deep-net architectures for the neuronal connectivity analysis. We show significant performance gains using our hybrid architecture on detection of topological structure (e.g. connectivity of neuronal processes and local intensity maxima on axons corresponding to synaptic swellings) with precision/recall close to 90% compared with human observers. We have adapted our architecture to a high performance pipeline capable of semantic segmentation of light microscopic whole-brain image data into a hierarchy of neuronal compartments. We expect that the hybrid architecture incorporating discrete Morse techniques into deep nets will generalize to other data domains.

Download Full-text

Large-Scale Scientific Data-Analysis and Infrastructure.

10.21236/ada310444 ◽

1995 ◽

Author(s):

Stephen Taylor

Keyword(s):

Data Analysis ◽

Large Scale ◽

Scientific Data ◽

Scientific Data Analysis

Download Full-text