scholarly journals A Python-oriented environment for climate experiments at scale in the frame of the European Open Science Cloud

Author(s):  
Donatello Elia ◽  
Fabrizio Antonio ◽  
Cosimo Palazzo ◽  
Paola Nassisi ◽  
Sofiane Bendoukha ◽  
...  

<p>Scientific data analysis experiments and applications require software capable of handling domain-specific and data-intensive workflows. The increasing volume of scientific data is further exacerbating these data management and analytics challenges, pushing the community towards the definition of novel programming environments for dealing efficiently with complex experiments, while abstracting from the underlying computing infrastructure. </p><p>ECASLab provides a user-friendly data analytics environment to support scientists in their daily research activities, in particular in the climate change domain, by integrating analysis tools with scientific datasets (e.g., from the ESGF data archive) and computing resources (i.e., Cloud and HPC-based). It combines the features of the ENES Climate Analytics Service (ECAS) and the JupyterHub service, with a wide set of scientific libraries from the Python landscape for data manipulation, analysis and visualization. ECASLab is being set up in the frame of the European Open Science Cloud (EOSC) platform - in the EU H2020 EOSC-Hub project - by CMCC (https://ecaslab.cmcc.it/) and DKRZ (https://ecaslab.dkrz.de/), which host two major instances of the environment. </p><p>ECAS, which lies at the heart of ECASLab, enables scientists to perform data analysis experiments on large volumes of multi-dimensional data by providing a workflow-oriented, PID-supported, server-side and distributed computing approach. ECAS consists of multiple components, centered around the Ophidia High Performance Data Analytics framework, which has been integrated with data access and sharing services (e.g., EUDAT B2DROP/B2SHARE, Onedata), along with the EGI federated cloud infrastructure. The integration with JupyterHub provides a convenient interface for scientists to access the ECAS features for the development and execution of experiments, as well as for sharing results (and the experiment/workflow definition itself). ECAS parallel data analytics capabilities can be easily exploited in Jupyter Notebooks (by means of PyOphidia, the Ophidia Python bindings) together with well-known Python modules for processing and for plotting the results on charts and maps (e.g., Dask, Xarray, NumPy, Matplotlib, etc.). ECAS is also one of the compute services made available to climate scientists by the EU H2020 IS-ENES3 project. </p><p>Hence, this integrated environment represents a complete software stack for the design and run of interactive experiments as well as complex and data-intensive workflows. One class of such large-scale workflows, efficiently implemented through the environment resources, refers to multi-model data analysis in the context of both CMIP5 and CMIP6 (i.e., precipitation trend analysis orchestrated in parallel over multiple CMIP-based datasets).</p>

2000 ◽  
Vol 09 (03) ◽  
pp. 293-297 ◽  
Author(s):  
D. BUSKULIC ◽  
L. DEROME ◽  
R. FLAMINIO ◽  
F. MARION ◽  
L. MASSONET ◽  
...  

A new generation of large scale and complex Gravitational Wave detectors is building up. They will produce big amount of data and will require intensive and specific interactive/batch data analysis. We will present VEGA, a framework for such data analysis, based on ROOT. VEGA uses the Frame format defined as standard by GW groups around the world. Furthermore, new tools are developed in order to facilitate data access and manipulation, as well as interface with existing algorithms. VEGA is currently evaluated by the VIRGO experiment.


2016 ◽  
Vol 54 ◽  
pp. 456-468 ◽  
Author(s):  
Changjun Hu ◽  
Yang Li ◽  
Xin Cheng ◽  
Zhenyu Liu

2007 ◽  
Vol 15 (4) ◽  
pp. 249-268 ◽  
Author(s):  
Gurmeet Singh ◽  
Karan Vahi ◽  
Arun Ramakrishnan ◽  
Gaurang Mehta ◽  
Ewa Deelman ◽  
...  

In this paper we examine the issue of optimizing disk usage and scheduling large-scale scientific workflows onto distributed resources where the workflows are data-intensive, requiring large amounts of data storage, and the resources have limited storage resources. Our approach is two-fold: we minimize the amount of space a workflow requires during execution by removing data files at runtime when they are no longer needed and we demonstrate that workflows may have to be restructured to reduce the overall data footprint of the workflow. We show the results of our data management and workflow restructuring solutions using a Laser Interferometer Gravitational-Wave Observatory (LIGO) application and an astronomy application, Montage, running on a large-scale production grid-the Open Science Grid. We show that although reducing the data footprint of Montage by 48% can be achieved with dynamic data cleanup techniques, LIGO Scientific Collaboration workflows require additional restructuring to achieve a 56% reduction in data space usage. We also examine the cost of the workflow restructuring in terms of the application's runtime.


2019 ◽  
Vol 6 (1) ◽  
pp. 47-55 ◽  
Author(s):  
Alexandra Paxton ◽  
Alexa Tullett

Today, researchers can collect, analyze, and share more data than ever before. Not only does increasing technological capacity open the door to new data-intensive perspectives in cognitive science and psychology (i.e., research that takes advantage of complex or large-scale data to understand human cognition and behavior), but increasing connectedness has sparked exponential increases in the ease and practice of scientific transparency. The growing open science movement encourages researchers to share data, materials, methods, and publications with other scientists and the wider public. Open science benefits data-intensive psychological science, the public, and public policy, and we present recommendations to improve the adoption of open science practices by changing the academic incentive structure and by improving the education pipeline. Despite ongoing questions about implementing open science guidelines, policy makers have an unprecedented opportunity to shape the next frontier of scientific discovery.


2012 ◽  
pp. 299-306 ◽  
Author(s):  
J.A. Dieleman ◽  
J.J. Magan ◽  
A.M. Wubs ◽  
A. Palloix ◽  
S. Lenk ◽  
...  
Keyword(s):  

Author(s):  
JÖRGEN BRANDT ◽  
WOLFGANG REISIG ◽  
ULF LESER

AbstractCuneiform is a minimal functional programming language for large-scale scientific data analysis. Implementing a strict black-box view on external operators and data, it allows the direct embedding of code in a variety of external languages like Python or R, provides data-parallel higher order operators for processing large partitioned data sets, allows conditionals and general recursion, and has a naturally parallelizable evaluation strategy suitable for multi-core servers and distributed execution environments like Hadoop, HTCondor, or distributed Erlang. Cuneiform has been applied in several data-intensive research areas including remote sensing, machine learning, and bioinformatics, all of which critically depend on the flexible assembly of pre-existing tools and libraries written in different languages into complex pipelines. This paper introduces the computation semantics for Cuneiform. It presents Cuneiform's abstract syntax, a simple type system, and the semantics of evaluation. Providing an unambiguous specification of the behavior of Cuneiform eases the implementation of interpreters which we showcase by providing a concise reference implementation in Erlang. The similarity of Cuneiform's syntax to the simply typed lambda calculus puts Cuneiform in perspective and allows a straightforward discussion of its design in the context of functional programming. Moreover, the simple type system allows the deduction of the language's safety up to black-box operators. Last, the formulation of the semantics also permits the verification of compilers to and from other workflow languages.


2018 ◽  
Author(s):  
Alexandra Paxton ◽  
Alexa Mary Tullett

Today, researchers can collect, analyze, and share more data than ever before. Not only does increasing technological capacity open the door to new data-intensive perspectives in cognitive science and psychology (that is, research that takes advantage of complex or large-scale data to understand human cognition and behavior), but increasing connectedness has sparked exponential increases in the ease and practice of scientific transparency. The growing open science movement encourages researchers to share data, materials, methods, and publications with other scientists and the wider public. Open science benefits data-intensive psychological science, the public, and public policy, and we present recommendations to improve the adoption of open science practices by changing the academic incentive structure and by improving the education pipeline. Despite ongoing questions about implementing open-science guidelines, policymakers have an unprecedented opportunity to shape the next frontier of scientific discovery.


Author(s):  
Samik Banerjee ◽  
Lucas Magee ◽  
Dingkang Wang ◽  
Xu Li ◽  
Bingxing Huo ◽  
...  

Understanding of neuronal circuitry at cellular resolution within the brain has relied on tract tracing methods which involve careful observation and interpretation by experienced neuroscientists. With recent developments in imaging and digitization, this approach is no longer feasible with the large scale (terabyte to petabyte range) images. Machine learning based techniques, using deep networks, provide an efficient alternative to the problem. However, these methods rely on very large volumes of annotated images for training and have error rates that are too high for scientific data analysis, and thus requires a significant volume of human-in-the-loop proofreading. Here we introduce a hybrid architecture combining prior structure in the form of topological data analysis methods, based on discrete Morse theory, with the best-in-class deep-net architectures for the neuronal connectivity analysis. We show significant performance gains using our hybrid architecture on detection of topological structure (e.g. connectivity of neuronal processes and local intensity maxima on axons corresponding to synaptic swellings) with precision/recall close to 90% compared with human observers. We have adapted our architecture to a high performance pipeline capable of semantic segmentation of light microscopic whole-brain image data into a hierarchy of neuronal compartments. We expect that the hybrid architecture incorporating discrete Morse techniques into deep nets will generalize to other data domains.


Sign in / Sign up

Export Citation Format

Share Document