Bio-Node – Bioinformatics in the Cloud

Mapping Intimacies ◽

10.1101/2020.04.15.043596 ◽

2020 ◽

Author(s):

Yannick Spreen ◽

Maximilian Miller

Keyword(s):

Large Scale ◽

Research Area ◽

Data Sets ◽

Web Interface ◽

Data Types ◽

Function Annotation ◽

Specific Data ◽

And Performance ◽

Faster Development ◽

Optimization And Performance

Motivation: The applicability and reproducibility of bioinformatics methods and results often depend on the structure and software architecture of their development. Exponentially growing data sets require ever more optimization and performance with conventional computing capacities lacking this process. This creates a large overhead for software development in a research area which is primarily interested in solving complex biological problems rather than developing new, performant software solutions. In pure computer science, new structures in the field of web development have produced more efficient processes for container-based software solutions. The advantages of these structures have rarely been explored in a broader scientific scale. This is also the case with the trend of migrating computations from on premise resources to the cloud. Results: We created Bio-Node, a new platform for large scale bio data analysis utilizing cloud compute resources (publicly available at https://bio-node.de). Bio-Node enables building complex workflows using a sophisticated web interface. We applied Bio-Node to implement bioinformatic workflows for rapid metagenome function annotation. We further developed "Auto-Clustering", a workflow that automatically extracts the most suited clustering parameters for specific data types and subsequently enables to optimally segregate unknown samples of the same type. Compared to existing methods and approaches Bio-Node improves performance and costs of bioinformatics data analyses while providing an easier and faster development process with focus on reproducibility and reusability.

Download Full-text

bíogo: a simple high-performance bioinformatics toolkit for the Go language

10.1101/005033 ◽

2014 ◽

Cited By ~ 6

Author(s):

R Daniel Kortschak ◽

David L Adelson

Keyword(s):

High Performance ◽

Large Scale ◽

Large Data ◽

Biological Data ◽

Data Sets ◽

Barriers To Entry ◽

Data Types ◽

Concurrent Processing ◽

Computationally Intensive ◽

And Performance

bíogo is a framework designed to ease development and maintenance of computationally intensive bioinformatics applications. The library is written in the Go programming language, a garbage-collected, strictly typed compiled language with built in support for concurrent processing, and performance comparable to C and Java. It provides a variety of data types and utility functions to facilitate manipulation and analysis of large scale genomic and other biological data. bíogo uses a concise and expressive syntax, lowering the barriers to entry for researchers needing to process large data sets with custom analyses while retaining computational safety and ease of code review. We believe bíogo provides an excellent environment for training and research in computational biology because of its combination of strict typing, simple and expressive syntax, and high performance.

Download Full-text

Electrolyte flow optimization and performance metrics analysis of vanadium redox flow battery for large-scale stationary energy storage

International Journal of Hydrogen Energy ◽

10.1016/j.ijhydene.2021.06.220 ◽

2021 ◽

Author(s):

Zebo Huang ◽

Anle Mu ◽

Longxing Wu ◽

Hang Wang ◽

Yongjun Zhang

Keyword(s):

Energy Storage ◽

Large Scale ◽

Performance Metrics ◽

Vanadium Redox Flow Battery ◽

Redox Flow Battery ◽

Electrolyte Flow ◽

Stationary Energy ◽

And Performance ◽

Vanadium Redox ◽

Optimization And Performance

Download Full-text

A Real-Time Log Analyzer Based on MongoDB

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.571-572.497 ◽

2014 ◽

Vol 571-572 ◽

pp. 497-501 ◽

Cited By ~ 3

Author(s):

Qi Lv ◽

Wei Xie

Keyword(s):

Real Time ◽

Large Scale ◽

Performance Comparison ◽

Log Analysis ◽

Data Sets ◽

Time Data ◽

Real Time Analysis ◽

Large Scale Data ◽

Implementation Approach ◽

And Performance

Real-time log analysis on large scale data is important for applications. Specifically, real-time refers to UI latency within 100ms. Therefore, techniques which efficiently support real-time analysis over large log data sets are desired. MongoDB provides well query performance, aggregation frameworks, and distributed architecture which is suitable for real-time data query and massive log analysis. In this paper, a novel implementation approach for an event driven file log analyzer is presented, and performance comparison of query, scan and aggregation operations over MongoDB, HBase and MySQL is analyzed. Our experimental results show that HBase performs best balanced in all operations, while MongoDB provides less than 10ms query speed in some operations which is most suitable for real-time applications.

Download Full-text

Research Challenges in Big Data Analytics

Decision Management ◽

10.4018/978-1-5225-1837-2.ch006 ◽

2017 ◽

pp. 83-99

Author(s):

Sivamathi Chokkalingam ◽

Vijayarani S.

Keyword(s):

Big Data ◽

Data Analytics ◽

Large Scale ◽

New Technologies ◽

Big Data Analytics ◽

Large Data ◽

Data Sets ◽

Data Types ◽

Customer Preferences ◽

Research Challenges

The term Big Data refers to large-scale information management and analysis technologies that exceed the capability of traditional data processing technologies. Big Data is differentiated from traditional technologies in three ways: volume, velocity and variety of data. Big data analytics is the process of analyzing large data sets which contains a variety of data types to uncover hidden patterns, unknown correlations, market trends, customer preferences and other useful business information. Since Big Data is new emerging field, there is a need for development of new technologies and algorithms for handling big data. The main objective of this paper is to provide knowledge about various research challenges of Big Data analytics. A brief overview of various types of Big Data analytics is discussed in this paper. For each analytics, the paper describes process steps and tools. A banking application is given for each analytics. Some of research challenges and possible solutions for those challenges of big data analytics are also discussed.

Download Full-text

A Comparative Study on a Built Sun Tracker and Fixed Converter Panels

International Journal of Energy Optimization and Engineering ◽

10.4018/ijeoe.2012100104 ◽

2012 ◽

Vol 1 (4) ◽

pp. 56-69

Author(s):

Farzin Shama ◽

Gholam Hossein Roshani ◽

Sobhan Roshani ◽

Arash Ahmadi ◽

Saber Karami

Keyword(s):

Renewable Energy ◽

Power Plants ◽

Large Scale ◽

Human Society ◽

Renewable Energy Resources ◽

Advantages And Disadvantages ◽

Production And Consumption ◽

Solar Power Plants ◽

And Performance ◽

Optimization And Performance

Producing non-polluting renewable energy in large scale is essential for sustainability of future developments in industry and human society. Among renewable energy resources, solar energy takes a special place because of its free accessibility and affordability. However, the optimization of its production and consumption processes poses important concerns, essentially in the affordability issue. This paper investigates several optimization and performance issues regarding solar panel converters using two-axis controlled solar tracer that has been practically implemented in comparison with fixed converter panels. Results shown in tables and graphs demonstrate clearly the advantages and disadvantages of the methods. Based on these results, large scale solar power plants are being suggested to be equipped with similar devices.

Download Full-text

A reduced order approach for probabilistic inversions of 3-D magnetotelluric data I: general formulation

Geophysical Journal International ◽

10.1093/gji/ggaa415 ◽

2020 ◽

Vol 223 (3) ◽

pp. 1837-1863

Author(s):

M C Manassero ◽

J C Afonso ◽

F Zyserman ◽

S Zlotnik ◽

I Fomin

Keyword(s):

Large Scale ◽

Computational Cost ◽

Forward Problem ◽

Parallel Structure ◽

Data Sets ◽

Magnetotelluric Data ◽

Reduced Order ◽

Mcmc Algorithms ◽

Simulation Based ◽

And Performance

SUMMARY Simulation-based probabilistic inversions of 3-D magnetotelluric (MT) data are arguably the best option to deal with the nonlinearity and non-uniqueness of the MT problem. However, the computational cost associated with the modelling of 3-D MT data has so far precluded the community from adopting and/or pursuing full probabilistic inversions of large MT data sets. In this contribution, we present a novel and general inversion framework, driven by Markov Chain Monte Carlo (MCMC) algorithms, which combines (i) an efficient parallel-in-parallel structure to solve the 3-D forward problem, (ii) a reduced order technique to create fast and accurate surrogate models of the forward problem and (iii) adaptive strategies for both the MCMC algorithm and the surrogate model. In particular, and contrary to traditional implementations, the adaptation of the surrogate is integrated into the MCMC inversion. This circumvents the need of costly offline stages to build the surrogate and further increases the overall efficiency of the method. We demonstrate the feasibility and performance of our approach to invert for large-scale conductivity structures with two numerical examples using different parametrizations and dimensionalities. In both cases, we report staggering gains in computational efficiency compared to traditional MCMC implementations. Our method finally removes the main bottleneck of probabilistic inversions of 3-D MT data and opens up new opportunities for both stand-alone MT inversions and multi-observable joint inversions for the physical state of the Earth’s interior.

Download Full-text

AstroCatR: a mechanism and tool for efficient time series reconstruction of large-scale astronomical catalogues

Monthly Notices of the Royal Astronomical Society ◽

10.1093/mnras/staa1413 ◽

2020 ◽

Vol 496 (1) ◽

pp. 629-637

Author(s):

Ce Yu ◽

Kun Li ◽

Shanjiang Tang ◽

Chao Sun ◽

Bin Ma ◽

...

Keyword(s):

Time Series ◽

High Performance ◽

Large Scale ◽

Extrasolar Planets ◽

Time Series Data ◽

Series Data ◽

Data Sets ◽

Observation Data ◽

Data Volume ◽

And Performance

ABSTRACT Time series data of celestial objects are commonly used to study valuable and unexpected objects such as extrasolar planets and supernova in time domain astronomy. Due to the rapid growth of data volume, traditional manual methods are becoming extremely hard and infeasible for continuously analysing accumulated observation data. To meet such demands, we designed and implemented a special tool named AstroCatR that can efficiently and flexibly reconstruct time series data from large-scale astronomical catalogues. AstroCatR can load original catalogue data from Flexible Image Transport System (FITS) files or data bases, match each item to determine which object it belongs to, and finally produce time series data sets. To support the high-performance parallel processing of large-scale data sets, AstroCatR uses the extract-transform-load (ETL) pre-processing module to create sky zone files and balance the workload. The matching module uses the overlapped indexing method and an in-memory reference table to improve accuracy and performance. The output of AstroCatR can be stored in CSV files or be transformed other into formats as needed. Simultaneously, the module-based software architecture ensures the flexibility and scalability of AstroCatR. We evaluated AstroCatR with actual observation data from The three Antarctic Survey Telescopes (AST3). The experiments demonstrate that AstroCatR can efficiently and flexibly reconstruct all time series data by setting relevant parameters and configuration files. Furthermore, the tool is approximately 3× faster than methods using relational data base management systems at matching massive catalogues.

Download Full-text

CMIP model evaluation with the ESMValTool v2.0

10.5194/egusphere-egu2020-13306 ◽

2020 ◽

Author(s):

Axel Lauer ◽

Fernando Iglesias-Suarez ◽

Veronika Eyring ◽

the ESMValTool development team

Keyword(s):

Model Evaluation ◽

Large Scale ◽

Performance Metrics ◽

Comprehensive Evaluation ◽

Coupled Model ◽

Data Sets ◽

Evaluation Tool ◽

Emergent Constraints ◽

Data Volume ◽

And Performance

<p>The Earth System Model Evaluation Tool (ESMValTool) has been developed with the aim of taking model evaluation to the next level by facilitating analysis of many different ESM components, providing well-documented source code and scientific background of implemented diagnostics and metrics and allowing for traceability and reproducibility of results (provenance). This has been made possible by a lively and growing development community continuously improving the tool supported by multiple national and European projects. The latest version (2.0) of the ESMValTool has been developed as a large community effort to specifically target the increased data volume of the Coupled Model Intercomparison Project Phase 6 (CMIP6) and the related challenges posed by analysis and evaluation of output from multiple high-resolution and complex ESMs. For this, the core functionalities have been completely rewritten in order to take advantage of state-of-the-art computational libraries and methods to allow for efficient and user-friendly data processing. Common operations on the input data such as regridding or computation of multi-model statistics are now centralized in a highly optimized preprocessor written in Python. The diagnostic part of the ESMValTool includes a large collection of standard recipes for reproducing peer-reviewed analyses of many variables across atmosphere, ocean, and land domains, with diagnostics and performance metrics focusing on the mean-state, trends, variability and important processes, phenomena, as well as emergent constraints. While most of the diagnostics use observational data sets (in particular satellite and ground-based observations) or reanalysis products for model evaluation some are also based on model-to-model comparisons. This presentation introduces the diagnostics newly implemented into ESMValTool v2.0 including an extended set of large-scale diagnostics for quasi-operational and comprehensive evaluation of ESMs, new diagnostics for extreme events, regional model and impact evaluation and analysis of ESMs, as well as diagnostics for emergent constraints and analysis of future projections from ESMs. The new diagnostics are illustrated with examples using results from the well-established CMIP5 and the newly available CMIP6 data sets.</p>

Download Full-text

Large-Scale Analysis of Genetic and Clinical Patient Data

Annual Review of Biomedical Data Science ◽

10.1146/annurev-biodatasci-080917-013508 ◽

2018 ◽

Vol 1 (1) ◽

pp. 263-274 ◽

Cited By ~ 6

Author(s):

Marylyn D. Ritchie

Keyword(s):

Clinical Data ◽

Large Scale ◽

Data Science ◽

Genomic Analysis ◽

Genomic Data ◽

Data Sets ◽

Biomedical Data ◽

Data Types ◽

Phenotypic Data ◽

Clinical Patient

Biomedical data science has experienced an explosion of new data over the past decade. Abundant genetic and genomic data are increasingly available in large, diverse data sets due to the maturation of modern molecular technologies. Along with these molecular data, dense, rich phenotypic data are also available on comprehensive clinical data sets from health care provider organizations, clinical trials, population health registries, and epidemiologic studies. The methods and approaches for interrogating these large genetic/genomic and clinical data sets continue to evolve rapidly, as our understanding of the questions and challenges continue to emerge. In this review, the state-of-the-art methodologies for genetic/genomic analysis along with complex phenomics will be discussed. This field is changing and adapting to the novel data types made available, as well as technological advances in computation and machine learning. Thus, I will also discuss the future challenges in this exciting and innovative space. The promises of precision medicine rely heavily on the ability to marry complex genetic/genomic data with clinical phenotypes in meaningful ways.

Download Full-text