scholarly journals clubber: removing the bioinformatics bottleneck in big data analyses

2017 ◽  
Vol 14 (2) ◽  
Author(s):  
Maximilian Miller ◽  
Chengsheng Zhu ◽  
Yana Bromberg

AbstractWith the advent of modern day high-throughput technologies, the bottleneck in biological discovery has shifted from the cost of doing experiments to that of analyzing results. clubber is our automated cluster-load balancing system developed for optimizing these “big data” analyses. Its plug-and-play framework encourages re-use of existing solutions for bioinformatics problems. clubber’s goals are to reduce computation times and to facilitate use of cluster computing. The first goal is achieved by automating the balance of parallel submissions across available high performance computing (HPC) resources. Notably, the latter can be added on demand, including cloud-based resources, and/or featuring heterogeneous environments. The second goal of making HPCs user-friendly is facilitated by an interactive web interface and a RESTful API, allowing for job monitoring and result retrieval. We used clubber to speed up our pipeline for annotating molecular functionality of metagenomes. Here, we analyzed the Deepwater Horizon oil-spill study data to quantitatively show that the beach sands have not yet entirely recovered. Further, our analysis of the CAMI-challenge data revealed that microbiome taxonomic shifts do not necessarily correlate with functional shifts. These examples (21 metagenomes processed in 172 min) clearly illustrate the importance of clubber in the everyday computational biology environment.

Author(s):  
Yao Wu ◽  
Long Zheng ◽  
Brian Heilig ◽  
Guang R Gao

As the attention given to big data grows, cluster computing systems for distributed processing of large data sets become the mainstream and critical requirement in high performance distributed system research. One of the most successful systems is Hadoop, which uses MapReduce as a programming/execution model and takes disks as intermedia to process huge volumes of data. Spark, as an in-memory computing engine, can solve the iterative and interactive problems more efficiently. However, currently it is a consensus that they are not the final solutions to big data due to a MapReduce-like programming model, synchronous execution model and the constraint that only supports batch processing, and so on. A new solution, especially, a fundamental evolution is needed to bring big data solutions into a new era. In this paper, we introduce a new cluster computing system called HAMR which supports both batch and streaming processing. To achieve better performance, HAMR integrates high performance computing approaches, i.e. dataflow fundamental into a big data solution. With more specifications, HAMR is fully designed based on in-memory computing to reduce the unnecessary disk access overhead; task scheduling and memory management are in fine-grain manner to explore more parallelism; asynchronous execution improves efficiency of computation resource usage, and also makes workload balance across the whole cluster better. The experimental results show that HAMR can outperform Hadoop MapReduce and Spark by up to 19x and 7x respectively, in the same cluster environment. Furthermore, HAMR can handle scaling data size well beyond the capabilities of Spark.


Author(s):  
Ouidad Achahbar ◽  
Mohamed Riduan Abid

The ongoing pervasiveness of Internet access is intensively increasing Big Data production. This, in turn, increases demand on compute power to process this massive data, and thus rendering High Performance Computing (HPC) into a high solicited service. Based on the paradigm of providing computing as a utility, the Cloud is offering user-friendly infrastructures for processing Big Data, e.g., High Performance Computing as a Service (HPCaaS). Still, HPCaaS performance is tightly coupled with the underlying virtualization technique since the latter is responsible for the creation of virtual machines that carry out data processing jobs. In this paper, the authors evaluate the impact of virtualization on HPCaaS. They track HPC performance under different Cloud virtualization platforms, namely KVM and VMware-ESXi, and compare it against physical clusters. Each tested cluster provided different performance trends. Yet, the overall analysis of the findings proved that the selection of virtualization technology can lead to significant improvements when handling HPCaaS.


2018 ◽  
Author(s):  
Chengjie Chen ◽  
Hao Chen ◽  
Yi Zhang ◽  
Hannah R. Thomas ◽  
Margaret H. Frank ◽  
...  

AbstractThe rapid development of high-throughput sequencing (HTS) techniques has led biology into the big-data era. Data analyses using various bioinformatics tools rely on programming and command-line environments, which are challenging and time-consuming for most wet-lab biologists. Here, we present TBtools (a Toolkit for Biologists integrating various biological data handling tools), a stand-alone software with a user-friendly interface. The toolkit incorporates over 100 functions, which are designed to meet the increasing demand for big-data analyses, ranging from bulk sequence processing to interactive data visualization. A wide variety of graphs can be prepared in TBtools, with a new plotting engine (“JIGplot”) developed to maximum their interactive ability, which allows quick point-and-click modification to almost every graphic feature.TBtools is a platform-independent software that can be run under all operating systems with Java Runtime Environment 1.6 or newer. It is freely available to non-commercial users at https://github.com/CJ-Chen/TBtools/releases.


Author(s):  
Ranjit Rajak

The computer technologies have rapidly developed in both software and hardware field. The complexity of software is increasing as per the market demand because the manual systems are going to become automation as well as the cost of hardware is decreasing. High Performance Computing (HPC) is very demanding technology and an attractive area of computing due to huge data processing in many applications of computing. The paper focus upon different applications of HPC and the types of HPC such as Cluster Computing, Grid Computing and Cloud Computing. It also studies, different classifications and applications of above types of HPC. All these types of HPC are demanding area of computer science. This paper also done comparative study of grid, cloud and cluster computing based on benefits, drawbacks, key areas of research, characterstics, issues and challenges.


Big Data ◽  
2016 ◽  
pp. 1687-1704
Author(s):  
Ouidad Achahbar ◽  
Mohamed Riduan Abid

The ongoing pervasiveness of Internet access is intensively increasing Big Data production. This, in turn, increases demand on compute power to process this massive data, and thus rendering High Performance Computing (HPC) into a high solicited service. Based on the paradigm of providing computing as a utility, the Cloud is offering user-friendly infrastructures for processing Big Data, e.g., High Performance Computing as a Service (HPCaaS). Still, HPCaaS performance is tightly coupled with the underlying virtualization technique since the latter is responsible for the creation of virtual machines that carry out data processing jobs. In this paper, the authors evaluate the impact of virtualization on HPCaaS. They track HPC performance under different Cloud virtualization platforms, namely KVM and VMware-ESXi, and compare it against physical clusters. Each tested cluster provided different performance trends. Yet, the overall analysis of the findings proved that the selection of virtualization technology can lead to significant improvements when handling HPCaaS.


2017 ◽  
Vol 2017 ◽  
pp. 1-8 ◽  
Author(s):  
Qiang Lan ◽  
Zelong Wang ◽  
Mei Wen ◽  
Chunyuan Zhang ◽  
Yijie Wang

Convolutional neural networks have proven to be highly successful in applications such as image classification, object tracking, and many other tasks based on 2D inputs. Recently, researchers have started to apply convolutional neural networks to video classification, which constitutes a 3D input and requires far larger amounts of memory and much more computation. FFT based methods can reduce the amount of computation, but this generally comes at the cost of an increased memory requirement. On the other hand, the Winograd Minimal Filtering Algorithm (WMFA) can reduce the number of operations required and thus can speed up the computation, without increasing the required memory. This strategy was shown to be successful for 2D neural networks. We implement the algorithm for 3D convolutional neural networks and apply it to a popular 3D convolutional neural network which is used to classify videos and compare it to cuDNN. For our highly optimized implementation of the algorithm, we observe a twofold speedup for most of the 3D convolution layers of our test network compared to the cuDNN version.


Sensors ◽  
2021 ◽  
Vol 21 (13) ◽  
pp. 4321
Author(s):  
Paola Soto ◽  
Miguel Camelo ◽  
Kevin Mets ◽  
Francesc Wilhelmi ◽  
David Góez ◽  
...  

IEEE 802.11 (Wi-Fi) is one of the technologies that provides high performance with a high density of connected devices to support emerging demanding services, such as virtual and augmented reality. However, in highly dense deployments, Wi-Fi performance is severely affected by interference. This problem is even worse in new standards, such as 802.11n/ac, where new features such as Channel Bonding (CB) are introduced to increase network capacity but at the cost of using wider spectrum channels. Finding the best channel assignment in dense deployments under dynamic environments with CB is challenging, given its combinatorial nature. Therefore, the use of analytical or system models to predict Wi-Fi performance after potential changes (e.g., dynamic channel selection with CB, and the deployment of new devices) are not suitable, due to either low accuracy or high computational cost. This paper presents a novel, data-driven approach to speed up this process, using a Graph Neural Network (GNN) model that exploits the information carried in the deployment’s topology and the intricate wireless interactions to predict Wi-Fi performance with high accuracy. The evaluation results show that preserving the graph structure in the learning process obtains a 64% increase versus a naive approach, and around 55% compared to other Machine Learning (ML) approaches when using all training features.


CCIT Journal ◽  
2017 ◽  
Vol 10 (2) ◽  
pp. 230-238
Author(s):  
Martono Martono ◽  
Kartika Kartika ◽  
Putri Aullia

The technological advances in the world of information is growing very fast and demanding to stay one step ahead and follow the development of existing ones. We must strive to make innovations in the era of competition is growing rapidly. With all the advantages that exist on the computer is very important for companies or government agencies and the private sector who want to develop and use the power of technology to improve performance. In conducting the data processing activities in order to obtain optimal results. With tersajinya information quickly will speed up the decision-making process, so as to take advantage of the cost, time and energy more effectively and efficiently. In the village of Karang Anyar family card collection system is less accurate and there is a delay of data information. In the family card collection that became the problem is still a lot of people or applicants who still do not understand how the procedure or what is the procedure to be followed by a family card holders. Family card manufacturing process only takes place in the village environment only. Social Study Data Collection Application Family Card is a Web-based application program used to display information about family card data on coral village later and also can facilitate citizens make family cards quickly and accurately


2020 ◽  
Vol 51 (1) ◽  
pp. 151-174
Author(s):  
Chung Joo Chung ◽  
Yunna Rhee ◽  
Heewon Cha

Sign in / Sign up

Export Citation Format

Share Document