scholarly journals Performance and scaling behavior of bioinformatic applications in virtualization environments to create awareness for the efficient use of compute resources

2021 ◽  
Vol 17 (7) ◽  
pp. e1009244
Author(s):  
Maximilian Hanussek ◽  
Felix Bartusch ◽  
Jens Krüger

The large amount of biological data available in the current times, makes it necessary to use tools and applications based on sophisticated and efficient algorithms, developed in the area of bioinformatics. Further, access to high performance computing resources is necessary, to achieve results in reasonable time. To speed up applications and utilize available compute resources as efficient as possible, software developers make use of parallelization mechanisms, like multithreading. Many of the available tools in bioinformatics offer multithreading capabilities, but more compute power is not always helpful. In this study we investigated the behavior of well-known applications in bioinformatics, regarding their performance in the terms of scaling, different virtual environments and different datasets with our benchmarking tool suite BOOTABLE. The tool suite includes the tools BBMap, Bowtie2, BWA, Velvet, IDBA, SPAdes, Clustal Omega, MAFFT, SINA and GROMACS. In addition we added an application using the machine learning framework TensorFlow. Machine learning is not directly part of bioinformatics but applied to many biological problems, especially in the context of medical images (X-ray photographs). The mentioned tools have been analyzed in two different virtual environments, a virtual machine environment based on the OpenStack cloud software and in a Docker environment. The gained performance values were compared to a bare-metal setup and among each other. The study reveals, that the used virtual environments produce an overhead in the range of seven to twenty-five percent compared to the bare-metal environment. The scaling measurements showed, that some of the analyzed tools do not benefit from using larger amounts of computing resources, whereas others showed an almost linear scaling behavior. The findings of this study have been generalized as far as possible and should help users to find the best amount of resources for their analysis. Further, the results provide valuable information for resource providers to handle their resources as efficiently as possible and raise the user community’s awareness of the efficient usage of computing resources.

2020 ◽  
Author(s):  
Alessandro Petrini ◽  
Marco Mesiti ◽  
Max Schubach ◽  
Marco Frasca ◽  
Daniel Danis ◽  
...  

AbstractSeveral prediction problems in Computational Biology and Genomic Medicine are characterized by both big data as well as a high imbalance between examples to be learned, whereby positive examples can represent a tiny minority with respect to negative examples. For instance, deleterious or pathogenic variants are overwhelmed by the sea of neutral variants in the non-coding regions of the genome: as a consequence the prediction of deleterious variants is a very challenging highly imbalanced classification problem, and classical prediction tools fail to detect the rare pathogenic examples among the huge amount of neutral variants or undergo severe restrictions in managing big genomic data. To overcome these limitations we propose parSMURF, a method that adopts a hyper-ensemble approach and oversampling and undersampling techniques to deal with imbalanced data, and parallel computational techniques to both manage big genomic data and significantly speed-up the computation. The synergy between Bayesian optimization techniques and the parallel nature of parSMURF enables efficient and user-friendly automatic tuning of the hyper-parameters of the algorithm, and allows specific learning problems in Genomic Medicine to be easily fit. Moreover, by using MPI parallel and machine learning ensemble techniques, parSMURF can manage big data by partitioning them across the nodes of a High Performance Computing cluster.Results with synthetic data and with single nucleotide variants associated with Mendelian diseases and with GWAS hits in the non-coding regions of the human genome, involving millions of examples, show that parSMURF achieves state-of-the-art results and a speed-up of 80× with respect to the sequential version.In conclusion parSMURF is a parallel machine learning tool that can be trained to learn different genomic problems, and its multiple levels of parallelization and its high scalability allow us to efficiently fit problems characterized by big and imbalanced genomic data.Availability and ImplementationThe C++ OpenMP multi-core version tailored to a single workstation and the C++ MPI/OpenMP hybrid multi-core and multi-node parSMURF version tailored to a High Performance Computing cluster are both available from github: https://github.com/AnacletoLAB/parSMURF


2021 ◽  
Vol 13 (4) ◽  
pp. 94
Author(s):  
Haokun Fang ◽  
Quan Qian

Privacy protection has been an important concern with the great success of machine learning. In this paper, it proposes a multi-party privacy preserving machine learning framework, named PFMLP, based on partially homomorphic encryption and federated learning. The core idea is all learning parties just transmitting the encrypted gradients by homomorphic encryption. From experiments, the model trained by PFMLP has almost the same accuracy, and the deviation is less than 1%. Considering the computational overhead of homomorphic encryption, we use an improved Paillier algorithm which can speed up the training by 25–28%. Moreover, comparisons on encryption key length, the learning network structure, number of learning clients, etc. are also discussed in detail in the paper.


2019 ◽  
Vol 214 ◽  
pp. 07012 ◽  
Author(s):  
Nikita Balashov ◽  
Maxim Bashashin ◽  
Pavel Goncharov ◽  
Ruslan Kuchumov ◽  
Nikolay Kutovskiy ◽  
...  

Cloud computing has become a routine tool for scientists in many fields. The JINR cloud infrastructure provides JINR users with computational resources to perform various scientific calculations. In order to speed up achievements of scientific results the JINR cloud service for parallel applications has been developed. It consists of several components and implements a flexible and modular architecture which allows to utilize both more applications and various types of resources as computational backends. An example of using the Cloud&HybriLIT resources in scientific computing is the study of superconducting processes in the stacked long Josephson junctions (LJJ). The LJJ systems have undergone intensive research because of the perspective of practical applications in nano-electronics and quantum computing. In this contribution we generalize the experience in application of the Cloud&HybriLIT resources for high performance computing of physical characteristics in the LJJ system.


2009 ◽  
Vol 01 (04) ◽  
pp. 737-763 ◽  
Author(s):  
E. MOEENDARBARY ◽  
T. Y. NG ◽  
M. ZANGENEH

The dissipative particle dynamics (DPD) technique is a relatively new mesoscale technique which was initially developed to simulate hydrodynamic behavior in mesoscopic complex fluids. It is essentially a particle technique in which molecules are clustered into the said particles, and this coarse graining is a very important aspect of the DPD as it allows significant computational speed-up. This increased computational efficiency, coupled with the recent advent of high performance computing, has subsequently enabled researchers to numerically study a host of complex fluid applications at a refined level. In this review, we trace the developments of various important aspects of the DPD methodology since it was first proposed in the in the early 1990's. In addition, we review notable published works which employed DPD simulation for complex fluid applications.


2019 ◽  
Vol 18 (4) ◽  
pp. 31-42 ◽  
Author(s):  
Carlos Arango ◽  
Rémy Dernat ◽  
John Sanabria

Virtualization technologies have evolved along with the development of computational environments. Virtualization offered needed features at that time such as isolation, accountability, resource allocation, resource fair sharing and so on. Novel processor technologies bring to commodity computers the possibility to emulate diverse environments where a wide range of computational scenarios can be run. Along with processors evolution, developers have implemented different virtualization mechanisms exhibiting enhanced performance from previous virtualized environments. Recently, operating system-based virtualization technologies captured the attention of communities abroad because their important improvements on performance area. In this paper, the features of three container-based operating systems virtualization tools (LXC, Docker and Singularity) are presented. LXC, Docker, Singularity and bare metal are put under test through a customized single node HPL-Benchmark and a MPI-based application for the multi node testbed. Also the disk I/O performance, Memory (RAM) performance, Network bandwidth and GPU performance are tested for the COS technologies vs bare metal. Preliminary results and conclusions around them are presented and discussed.


2019 ◽  
Vol 16 (2) ◽  
pp. 541-564
Author(s):  
Mathias Longo ◽  
Ana Rodriguez ◽  
Cristian Mateos ◽  
Alejandro Zunino

In-silico research has grown considerably. Today?s scientific code involves long-running computer simulations and hence powerful computing infrastructures are needed. Traditionally, research in high-performance computing has focused on executing code as fast as possible, while energy has been recently recognized as another goal to consider. Yet, energy-driven research has mostly focused on the hardware and middleware layers, but few efforts target the application level, where many energy-aware optimizations are possible. We revisit a catalog of Java primitives commonly used in OO scientific programming, or micro-benchmarks, to identify energy-friendly versions of the same primitive. We then apply the micro-benchmarks to classical scientific application kernels and machine learning algorithms for both single-thread and multi-thread implementations on a server. Energy usage reductions at the micro-benchmark level are substantial, while for applications obtained reductions range from 3.90% to 99.18%.


2017 ◽  
Vol 29 (3) ◽  
Author(s):  
Mabule Samuel Mabakane ◽  
Daniel Mojalefa Moeketsi ◽  
Anton Lopis

This paper presents a case study on the scalability of several versions of the molecular dynamics code (DL_POLY) performed on South Africa‘s Centre for High Performance Computing e1350 IBM Linux cluster, Sun system and Lengau supercomputers. Within this study different problem sizes were designed and the same chosen systems were employed in order to test the performance of DL_POLY using weak and strong scalability. It was found that the speed-up results for the small systems were better than large systems on both Ethernet and Infiniband network. However, simulations of large systems in DL_POLY performed well using Infiniband network on Lengau cluster as compared to e1350 and Sun supercomputer.


Sign in / Sign up

Export Citation Format

Share Document