Integrated Cloud Computing Environment for Upstream Geoscience Workflows

Mapping Intimacies ◽

10.2118/204848-ms ◽

2021 ◽

Author(s):

Murtadha Al-Habib ◽

Yasser Al-Ghamdi

Keyword(s):

High Performance ◽

Large Scale ◽

End Users ◽

Data Sets ◽

Production Environment ◽

Remote Visualization ◽

Test Environment ◽

Petroleum Resources ◽

Customized Production ◽

And Performance

Abstract Extensive computing resources are required to leverage todays advanced geoscience workflows that are used to explore and characterize giant petroleum resources. In these cases, high-performance workstations are often unable to adequately handle the scale of computing required. The workflows typically utilize complex and massive data sets, which require advanced computing resources to store, process, manage, and visualize various forms of the data throughout the various lifecycles. This work describes a large-scale geoscience end-to-end interpretation platform customized to run on a cluster-based remote visualization environment. A team of computing infrastructure and geoscience workflow experts was established to collaborate on the deployment, which was broken down into separate phases. Initially, an evaluation and analysis phase was conducted to analyze computing requirements and assess potential solutions. A testing environment was then designed, implemented and benchmarked. The third phase used the test environment to determine the scale of infrastructure required for the production environment. Finally, the full-scale customized production environment was deployed for end users. During testing phase, aspects such as connectivity, stability, interactivity, functionality, and performance were investigated using the largest available geoscience datasets. Multiple computing configurations were benchmarked until optimal performance was achieved, under applicable corporate information security guidelines. It was observed that the customized production environment was able to execute workflows that were unable to run on local user workstations. For example, while conducting connectivity, stability and interactivity benchmarking, the test environment was operated for extended periods to ensure stability for workflows that require multiple days to run. To estimate the scale of the required production environment, varying categories of users’ portfolio were determined based on data type, scale and workflow. Continuous monitoring of system resources and utilization enabled continuous improvements to the final solution. The utilization of a fit-for-purpose, customized remote visualization solution may reduce or ultimately eliminate the need to deploy high-end workstations to all end users. Rather, a shared, scalable and reliable cluster-based solution can serve a much larger user community in a highly performant manner.

Download Full-text

AstroCatR: a mechanism and tool for efficient time series reconstruction of large-scale astronomical catalogues

Monthly Notices of the Royal Astronomical Society ◽

10.1093/mnras/staa1413 ◽

2020 ◽

Vol 496 (1) ◽

pp. 629-637

Author(s):

Ce Yu ◽

Kun Li ◽

Shanjiang Tang ◽

Chao Sun ◽

Bin Ma ◽

...

Keyword(s):

Time Series ◽

High Performance ◽

Large Scale ◽

Extrasolar Planets ◽

Time Series Data ◽

Series Data ◽

Data Sets ◽

Observation Data ◽

Data Volume ◽

And Performance

ABSTRACT Time series data of celestial objects are commonly used to study valuable and unexpected objects such as extrasolar planets and supernova in time domain astronomy. Due to the rapid growth of data volume, traditional manual methods are becoming extremely hard and infeasible for continuously analysing accumulated observation data. To meet such demands, we designed and implemented a special tool named AstroCatR that can efficiently and flexibly reconstruct time series data from large-scale astronomical catalogues. AstroCatR can load original catalogue data from Flexible Image Transport System (FITS) files or data bases, match each item to determine which object it belongs to, and finally produce time series data sets. To support the high-performance parallel processing of large-scale data sets, AstroCatR uses the extract-transform-load (ETL) pre-processing module to create sky zone files and balance the workload. The matching module uses the overlapped indexing method and an in-memory reference table to improve accuracy and performance. The output of AstroCatR can be stored in CSV files or be transformed other into formats as needed. Simultaneously, the module-based software architecture ensures the flexibility and scalability of AstroCatR. We evaluated AstroCatR with actual observation data from The three Antarctic Survey Telescopes (AST3). The experiments demonstrate that AstroCatR can efficiently and flexibly reconstruct all time series data by setting relevant parameters and configuration files. Furthermore, the tool is approximately 3× faster than methods using relational data base management systems at matching massive catalogues.

Download Full-text

bíogo: a simple high-performance bioinformatics toolkit for the Go language

10.1101/005033 ◽

2014 ◽

Cited By ~ 6

Author(s):

R Daniel Kortschak ◽

David L Adelson

Keyword(s):

High Performance ◽

Large Scale ◽

Large Data ◽

Biological Data ◽

Data Sets ◽

Barriers To Entry ◽

Data Types ◽

Concurrent Processing ◽

Computationally Intensive ◽

And Performance

bíogo is a framework designed to ease development and maintenance of computationally intensive bioinformatics applications. The library is written in the Go programming language, a garbage-collected, strictly typed compiled language with built in support for concurrent processing, and performance comparable to C and Java. It provides a variety of data types and utility functions to facilitate manipulation and analysis of large scale genomic and other biological data. bíogo uses a concise and expressive syntax, lowering the barriers to entry for researchers needing to process large data sets with custom analyses while retaining computational safety and ease of code review. We believe bíogo provides an excellent environment for training and research in computational biology because of its combination of strict typing, simple and expressive syntax, and high performance.

Download Full-text

High Performance Computational Analysis of Large-scale Proteome Data Sets to Assess Incremental Contribution to Coverage of the Human Genome

Journal of Proteome Research ◽

10.1021/pr400181q ◽

2013 ◽

Vol 12 (6) ◽

pp. 2858-2868 ◽

Cited By ~ 29

Author(s):

Nadin Neuhauser ◽

Nagarjuna Nagaraj ◽

Peter McHardy ◽

Sara Zanivan ◽

Richard Scheltema ◽

...

Keyword(s):

Human Genome ◽

High Performance ◽

Large Scale ◽

Computational Analysis ◽

Data Sets

Download Full-text

A Real-Time Log Analyzer Based on MongoDB

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.571-572.497 ◽

2014 ◽

Vol 571-572 ◽

pp. 497-501 ◽

Cited By ~ 3

Author(s):

Qi Lv ◽

Wei Xie

Keyword(s):

Real Time ◽

Large Scale ◽

Performance Comparison ◽

Log Analysis ◽

Data Sets ◽

Time Data ◽

Real Time Analysis ◽

Large Scale Data ◽

Implementation Approach ◽

And Performance

Real-time log analysis on large scale data is important for applications. Specifically, real-time refers to UI latency within 100ms. Therefore, techniques which efficiently support real-time analysis over large log data sets are desired. MongoDB provides well query performance, aggregation frameworks, and distributed architecture which is suitable for real-time data query and massive log analysis. In this paper, a novel implementation approach for an event driven file log analyzer is presented, and performance comparison of query, scan and aggregation operations over MongoDB, HBase and MySQL is analyzed. Our experimental results show that HBase performs best balanced in all operations, while MongoDB provides less than 10ms query speed in some operations which is most suitable for real-time applications.

Download Full-text

A reduced order approach for probabilistic inversions of 3-D magnetotelluric data I: general formulation

Geophysical Journal International ◽

10.1093/gji/ggaa415 ◽

2020 ◽

Vol 223 (3) ◽

pp. 1837-1863

Author(s):

M C Manassero ◽

J C Afonso ◽

F Zyserman ◽

S Zlotnik ◽

I Fomin

Keyword(s):

Large Scale ◽

Computational Cost ◽

Forward Problem ◽

Parallel Structure ◽

Data Sets ◽

Magnetotelluric Data ◽

Reduced Order ◽

Mcmc Algorithms ◽

Simulation Based ◽

And Performance

SUMMARY Simulation-based probabilistic inversions of 3-D magnetotelluric (MT) data are arguably the best option to deal with the nonlinearity and non-uniqueness of the MT problem. However, the computational cost associated with the modelling of 3-D MT data has so far precluded the community from adopting and/or pursuing full probabilistic inversions of large MT data sets. In this contribution, we present a novel and general inversion framework, driven by Markov Chain Monte Carlo (MCMC) algorithms, which combines (i) an efficient parallel-in-parallel structure to solve the 3-D forward problem, (ii) a reduced order technique to create fast and accurate surrogate models of the forward problem and (iii) adaptive strategies for both the MCMC algorithm and the surrogate model. In particular, and contrary to traditional implementations, the adaptation of the surrogate is integrated into the MCMC inversion. This circumvents the need of costly offline stages to build the surrogate and further increases the overall efficiency of the method. We demonstrate the feasibility and performance of our approach to invert for large-scale conductivity structures with two numerical examples using different parametrizations and dimensionalities. In both cases, we report staggering gains in computational efficiency compared to traditional MCMC implementations. Our method finally removes the main bottleneck of probabilistic inversions of 3-D MT data and opens up new opportunities for both stand-alone MT inversions and multi-observable joint inversions for the physical state of the Earth’s interior.

Download Full-text

CMIP model evaluation with the ESMValTool v2.0

10.5194/egusphere-egu2020-13306 ◽

2020 ◽

Author(s):

Axel Lauer ◽

Fernando Iglesias-Suarez ◽

Veronika Eyring ◽

the ESMValTool development team

Keyword(s):

Model Evaluation ◽

Large Scale ◽

Performance Metrics ◽

Comprehensive Evaluation ◽

Coupled Model ◽

Data Sets ◽

Evaluation Tool ◽

Emergent Constraints ◽

Data Volume ◽

And Performance

<p>The Earth System Model Evaluation Tool (ESMValTool) has been developed with the aim of taking model evaluation to the next level by facilitating analysis of many different ESM components, providing well-documented source code and scientific background of implemented diagnostics and metrics and allowing for traceability and reproducibility of results (provenance). This has been made possible by a lively and growing development community continuously improving the tool supported by multiple national and European projects. The latest version (2.0) of the ESMValTool has been developed as a large community effort to specifically target the increased data volume of the Coupled Model Intercomparison Project Phase 6 (CMIP6) and the related challenges posed by analysis and evaluation of output from multiple high-resolution and complex ESMs. For this, the core functionalities have been completely rewritten in order to take advantage of state-of-the-art computational libraries and methods to allow for efficient and user-friendly data processing. Common operations on the input data such as regridding or computation of multi-model statistics are now centralized in a highly optimized preprocessor written in Python. The diagnostic part of the ESMValTool includes a large collection of standard recipes for reproducing peer-reviewed analyses of many variables across atmosphere, ocean, and land domains, with diagnostics and performance metrics focusing on the mean-state, trends, variability and important processes, phenomena, as well as emergent constraints. While most of the diagnostics use observational data sets (in particular satellite and ground-based observations) or reanalysis products for model evaluation some are also based on model-to-model comparisons. This presentation introduces the diagnostics newly implemented into ESMValTool v2.0 including an extended set of large-scale diagnostics for quasi-operational and comprehensive evaluation of ESMs, new diagnostics for extreme events, regional model and impact evaluation and analysis of ESMs, as well as diagnostics for emergent constraints and analysis of future projections from ESMs. The new diagnostics are illustrated with examples using results from the well-established CMIP5 and the newly available CMIP6 data sets.</p>

Download Full-text

MG-RAST version 4—lessons learned from a decade of low-budget ultra-high-throughput metagenome analysis

Briefings in Bioinformatics ◽

10.1093/bib/bbx105 ◽

2017 ◽

Vol 20 (4) ◽

pp. 1151-1159 ◽

Cited By ~ 28

Author(s):

Folker Meyer ◽

Saurabh Bagchi ◽

Somali Chaterji ◽

Wolfgang Gerlach ◽

Ananth Grama ◽

...

Keyword(s):

Performance Optimization ◽

High Performance ◽

Lessons Learned ◽

Added Value ◽

Data Sets ◽

Real World Data ◽

Metagenome Analysis ◽

Trade Offs ◽

And Performance ◽

The Right

Abstract As technologies change, MG-RAST is adapting. Newly available software is being included to improve accuracy and performance. As a computational service constantly running large volume scientific workflows, MG-RAST is the right location to perform benchmarking and implement algorithmic or platform improvements, in many cases involving trade-offs between specificity, sensitivity and run-time cost. The work in [Glass EM, Dribinsky Y, Yilmaz P, et al. ISME J 2014;8:1–3] is an example; we use existing well-studied data sets as gold standards representing different environments and different technologies to evaluate any changes to the pipeline. Currently, we use well-understood data sets in MG-RAST as platform for benchmarking. The use of artificial data sets for pipeline performance optimization has not added value, as these data sets are not presenting the same challenges as real-world data sets. In addition, the MG-RAST team welcomes suggestions for improvements of the workflow. We are currently working on versions 4.02 and 4.1, both of which contain significant input from the community and our partners that will enable double barcoding, stronger inferences supported by longer-read technologies, and will increase throughput while maintaining sensitivity by using Diamond and SortMeRNA. On the technical platform side, the MG-RAST team intends to support the Common Workflow Language as a standard to specify bioinformatics workflows, both to facilitate development and efficient high-performance implementation of the community’s data analysis tasks.

Download Full-text

Massively multi-user online platform for large-scale applications

10.32920/ryerson.14645280 ◽

2021 ◽

Author(s):

Allen Yen-Cheng Yu

Keyword(s):

Grid Computing ◽

High Performance ◽

Large Scale ◽

Distributed Applications ◽

Service Architecture ◽

Online Application ◽

Online Platform ◽

Massively Multiplayer Online Game ◽

And Performance

Many large-scale online applications enable thousands of users to access their services simultaneously. However, the overall service quality of an online application usually degrades when the number of users increases because, traditionally, centralized server architecture does not scale well. In order to provide better Quality of Service (QoS), service architecture such as Grid computing can be used. This type of architecture offers service scalability by utilizing heterogeneous hardware resources. In this thesis, a novel design of Grid computing middleware, Massively Multi-user Online Platform (MMOP), which integrates the Peer-to-Peer (P2P) structured overlays, is proposed. The objectives of this proposed design are to offer scalability and system design flexibility, simplify development processes of distributed applications, and improve QoS by following specified policy rules. A Massively Multiplayer Online Game (MMOG) has been created to validate the functionality and performance of MMOP. The simulation results have demonstrated that MMOP is a high performance and scalable servicing and computing middleware.

Download Full-text

Massively multi-user online platform for large-scale applications

10.32920/ryerson.14645280.v1 ◽

2021 ◽

Author(s):

Allen Yen-Cheng Yu

Keyword(s):

Grid Computing ◽

High Performance ◽

Large Scale ◽

Distributed Applications ◽

Service Architecture ◽

Online Application ◽

Online Platform ◽

Massively Multiplayer Online Game ◽

And Performance

Download Full-text

MS-PyCloud: An open-source, cloud computing-based pipeline for LC-MS/MS data analysis

10.1101/320887 ◽

2018 ◽

Cited By ~ 2

Author(s):

Li Chen ◽

Bai Zhang ◽

Michael Schnaubelt ◽

Punit Shah ◽

Paul Aiyetan ◽

...

Keyword(s):

Cloud Computing ◽

Data Analysis ◽

Open Source ◽

High Performance ◽

Large Scale ◽

Rapid Development ◽

Data File ◽

Data Sets ◽

Proteomics Data ◽

Amazon Web Services

ABSTRACTRapid development and wide adoption of mass spectrometry-based proteomics technologies have empowered scientists to study proteins and their modifications in complex samples on a large scale. This progress has also created unprecedented challenges for individual labs to store, manage and analyze proteomics data, both in the cost for proprietary software and high-performance computing, and the long processing time that discourages on-the-fly changes of data processing settings required in explorative and discovery analysis. We developed an open-source, cloud computing-based pipeline, MS-PyCloud, with graphical user interface (GUI) support, for LC-MS/MS data analysis. The major components of this pipeline include data file integrity validation, MS/MS database search for spectral assignment, false discovery rate estimation, protein inference, determination of protein post-translation modifications, and quantitation of specific (modified) peptides and proteins. To ensure the transparency and reproducibility of data analysis, MS-PyCloud includes open source software tools with comprehensive testing and versioning for spectrum assignments. Leveraging public cloud computing infrastructure via Amazon Web Services (AWS), MS-PyCloud scales seamlessly based on analysis demand to achieve fast and efficient performance. Application of the pipeline to the analysis of large-scale iTRAQ/TMT LC-MS/MS data sets demonstrated the effectiveness and high performance of MS-PyCloud. The software can be downloaded at: https://bitbucket.org/mschnau/ms-pycloud/downloads/

Download Full-text