Data Intensive Computing for Bioinformatics

Data intensive computing, cloud computing, and multicore computing are converging as frontiers to address massive data problems with hybrid programming models and/or runtimes including MapReduce, MPI, and parallel threading on multicore platforms. A major challenge is to utilize these technologies and large-scale computing resources effectively to advance fundamental science discoveries such as those in Life Sciences. The recently developed next-generation sequencers have enabled large-scale genome sequencing in areas such as environmental sample sequencing leading to metagenomic studies of collections of genes. Metagenomic research is just one of the areas that present a significant computational challenge because of the amount and complexity of data to be processed. This chapter discusses the use of innovative data-mining algorithms and new programming models for several Life Sciences applications. The authors particularly focus on methods that are applicable to large data sets coming from high throughput devices of steadily increasing power. They show results for both clustering and dimension reduction algorithms, and the use of MapReduce on modest size problems. They identify two key areas where further research is essential, and propose to develop new O(NlogN) complexity algorithms suitable for the analysis of millions of sequences. They suggest Iterative MapReduce as a promising programming model combining the best features of MapReduce with those of high performance environments such as MPI.

Download Full-text

Enabling Large-Scale Biomedical Analysis in the Cloud

BioMed Research International ◽

10.1155/2013/185679 ◽

2013 ◽

Vol 2013 ◽

pp. 1-6 ◽

Cited By ~ 10

Author(s):

Ying-Chih Lin ◽

Chin-Sheng Yu ◽

Yen-Jen Lin

Keyword(s):

High Performance ◽

Large Scale ◽

Computing System ◽

Biomedical Data ◽

Data Intensive Computing ◽

Biomedical Analysis ◽

Data Intensive ◽

Large Scale Data ◽

Performance Computing ◽

Scale Data

Recent progress in high-throughput instrumentations has led to an astonishing growth in both volume and complexity of biomedical data collected from various sources. The planet-size data brings serious challenges to the storage and computing technologies. Cloud computing is an alternative to crack the nut because it gives concurrent consideration to enable storage and high-performance computing on large-scale data. This work briefly introduces the data intensive computing system and summarizes existing cloud-based resources in bioinformatics. These developments and applications would facilitate biomedical research to make the vast amount of diversification data meaningful and usable.

Download Full-text

High Performance Computational Analysis of Large-scale Proteome Data Sets to Assess Incremental Contribution to Coverage of the Human Genome

Journal of Proteome Research ◽

10.1021/pr400181q ◽

2013 ◽

Vol 12 (6) ◽

pp. 2858-2868 ◽

Cited By ~ 29

Author(s):

Nadin Neuhauser ◽

Nagarjuna Nagaraj ◽

Peter McHardy ◽

Sara Zanivan ◽

Richard Scheltema ◽

...

Keyword(s):

Human Genome ◽

High Performance ◽

Large Scale ◽

Computational Analysis ◽

Data Sets

Download Full-text

Task-based programming in COMPSs to converge from HPC to big data

The International Journal of High Performance Computing Applications ◽

10.1177/1094342017701278 ◽

2017 ◽

Vol 32 (1) ◽

pp. 45-60 ◽

Cited By ~ 11

Author(s):

Javier Conejero ◽

Sandra Corella ◽

Rosa M Badia ◽

Jesus Labarta

Keyword(s):

Big Data ◽

High Performance ◽

Programming Model ◽

Good Alternative ◽

Programming Models ◽

Suitable Model ◽

Advantages And Disadvantages ◽

Big Data Applications ◽

And Performance ◽

The Right

Task-based programming has proven to be a suitable model for high-performance computing (HPC) applications. Different implementations have been good demonstrators of this fact and have promoted the acceptance of task-based programming in the OpenMP standard. Furthermore, in recent years, Apache Spark has gained wide popularity in business and research environments as a programming model for addressing emerging big data problems. COMP Superscalar (COMPSs) is a task-based environment that tackles distributed computing (including Clouds) and is a good alternative for a task-based programming model for big data applications. This article describes why we consider that task-based programming models are a good approach for big data applications. The article includes a comparison of Spark and COMPSs in terms of architecture, programming model, and performance. It focuses on the differences that both frameworks have in structural terms, on their programmability interface, and in terms of their efficiency by means of three widely known benchmarking kernels: Wordcount, Kmeans, and Terasort. These kernels enable the evaluation of the more important functionalities of both programming models and analyze different work flows and conditions. The main results achieved from this comparison are (1) COMPSs is able to extract the inherent parallelism from the user code with minimal coding effort as opposed to Spark, which requires the existing algorithms to be adapted and rewritten by explicitly using their predefined functions, (2) it is an improvement in terms of performance when compared with Spark, and (3) COMPSs has shown to scale better than Spark in most cases. Finally, we discuss the advantages and disadvantages of both frameworks, highlighting the differences that make them unique, thereby helping to choose the right framework for each particular objective.

Download Full-text

Decision Rule Extraction for Regularized Multiple Criteria Linear Programming Model

International Journal of Data Warehousing and Mining ◽

10.4018/jdwm.2011070104 ◽

2011 ◽

Vol 7 (3) ◽

pp. 88-101 ◽

Cited By ~ 1

Author(s):

DongHong Sun ◽

Li Liu ◽

Peng Zhang ◽

Xingquan Zhu ◽

Yong Shi

Keyword(s):

Linear Programming ◽

Portfolio Management ◽

Real World ◽

Credit Card ◽

Extraction Method ◽

Large Scale ◽

Programming Model ◽

Rule Extraction ◽

Multiple Criteria ◽

Data Sets

Due to the flexibility of multi-criteria optimization, Regularized Multiple Criteria Linear Programming (RMCLP) has received attention in decision support systems. Numerous theoretical and empirical studies have demonstrated that RMCLP is effective and efficient in classifying large scale data sets. However, a possible limitation of RMCLP is poor interpretability and low comprehensibility for end users and experts. This deficiency has limited RMCLP’s use in many real-world applications where both accuracy and transparency of decision making are required, such as in Customer Relationship Management (CRM) and Credit Card Portfolio Management. In this paper, the authors present a clustering based rule extraction method to extract explainable and understandable rules from the RMCLP model. Experiments on both synthetic and real world data sets demonstrate that this rule extraction method can effectively extract explicit decision rules from RMCLP with only a small compromise in performance.

Download Full-text

High Performance Network Architectures for Data Intensive Computing

Handbook of Data Intensive Computing ◽

10.1007/978-1-4614-1415-5_1 ◽

2011 ◽

pp. 3-23 ◽

Cited By ~ 1

Author(s):

Geng Lin ◽

Eileen Liu

Keyword(s):

High Performance ◽

Network Architectures ◽

Data Intensive Computing ◽

Data Intensive

Download Full-text

AstroCatR: a mechanism and tool for efficient time series reconstruction of large-scale astronomical catalogues

Monthly Notices of the Royal Astronomical Society ◽

10.1093/mnras/staa1413 ◽

2020 ◽

Vol 496 (1) ◽

pp. 629-637

Author(s):

Ce Yu ◽

Kun Li ◽

Shanjiang Tang ◽

Chao Sun ◽

Bin Ma ◽

...

Keyword(s):

Time Series ◽

High Performance ◽

Large Scale ◽

Extrasolar Planets ◽

Time Series Data ◽

Series Data ◽

Data Sets ◽

Observation Data ◽

Data Volume ◽

And Performance

ABSTRACT Time series data of celestial objects are commonly used to study valuable and unexpected objects such as extrasolar planets and supernova in time domain astronomy. Due to the rapid growth of data volume, traditional manual methods are becoming extremely hard and infeasible for continuously analysing accumulated observation data. To meet such demands, we designed and implemented a special tool named AstroCatR that can efficiently and flexibly reconstruct time series data from large-scale astronomical catalogues. AstroCatR can load original catalogue data from Flexible Image Transport System (FITS) files or data bases, match each item to determine which object it belongs to, and finally produce time series data sets. To support the high-performance parallel processing of large-scale data sets, AstroCatR uses the extract-transform-load (ETL) pre-processing module to create sky zone files and balance the workload. The matching module uses the overlapped indexing method and an in-memory reference table to improve accuracy and performance. The output of AstroCatR can be stored in CSV files or be transformed other into formats as needed. Simultaneously, the module-based software architecture ensures the flexibility and scalability of AstroCatR. We evaluated AstroCatR with actual observation data from The three Antarctic Survey Telescopes (AST3). The experiments demonstrate that AstroCatR can efficiently and flexibly reconstruct all time series data by setting relevant parameters and configuration files. Furthermore, the tool is approximately 3× faster than methods using relational data base management systems at matching massive catalogues.

Download Full-text

HiperIO 2008: Second international workshop on high performance I/O systems and data intensive computing

2008 IEEE International Conference on Cluster Computing ◽

10.1109/clustr.2008.4663746 ◽

2008 ◽

Keyword(s):

High Performance ◽

International Workshop ◽

Data Intensive Computing ◽

Data Intensive ◽

Second International

Download Full-text

Study of parallel programming models on computer clusters with Intel MIC coprocessors

The International Journal of High Performance Computing Applications ◽

10.1177/1094342015580864 ◽

2015 ◽

Vol 31 (4) ◽

pp. 303-315 ◽

Cited By ~ 3

Author(s):

Miaoqing Huang ◽

Chenggang Lai ◽

Xuan Shi ◽

Zhijun Hao ◽

Haihang You

Keyword(s):

Parallel Programming ◽

High Performance ◽

Programming Model ◽

Fixed Number ◽

Parallel Applications ◽

Programming Models ◽

Communication Overhead ◽

Computer Clusters ◽

Parallel Programming Models ◽

Intel Mic

Coprocessors based on the Intel Many Integrated Core (MIC) Architecture have been adopted in many high-performance computer clusters. Typical parallel programming models, such as MPI and OpenMP, are supported on MIC processors to achieve the parallelism. In this work, we conduct a detailed study on the performance and scalability of the MIC processors under different programming models using the Beacon computer cluster. Our findings are as follows. (1) The native MPI programming model on the MIC processors is typically better than the offload programming model, which offloads the workload to MIC cores using OpenMP. (2) On top of the native MPI programming model, multithreading inside each MPI process can further improve the performance for parallel applications on computer clusters with MIC coprocessors. (3) Given a fixed number of MPI processes, it is a good strategy to schedule these MPI processes to as few MIC processors as possible to reduce the cross-processor communication overhead. (4) The hybrid MPI programming model, in which data processing is distributed to both MIC cores and CPU cores, can outperform the native MPI programming model.

Download Full-text

Cloud Computing for Rigorous Coupled-Wave Analysis

Advances in Optical Technologies ◽

10.1155/2012/398341 ◽

2012 ◽

Vol 2012 ◽

pp. 1-7 ◽

Cited By ~ 6

Author(s):

N. L. Kazanskiy ◽

P. G. Serafimovich

Keyword(s):

Cloud Computing ◽

High Performance ◽

Parallel Applications ◽

Data Sets ◽

Wave Analysis ◽

Data Intensive ◽

Rigorous Coupled Wave Analysis ◽

Rigorous Coupled Wave ◽

Effective Use ◽

Coupled Wave

Design and analysis of complex nanophotonic and nanoelectronic structures require significant computing resources. Cloud computing infrastructure allows distributed parallel applications to achieve greater scalability and fault tolerance. The problems of effective use of high-performance computing systems for modeling and simulation of subwavelength diffraction gratings are considered. Rigorous coupled-wave analysis (RCWA) is adapted to cloud computing environment. In order to accomplish this, data flow of the RCWA is analyzed and CPU-intensive operations are converted to data-intensive operations. The generated data sets are structured in accordance with the requirements of MapReduce technology.

Download Full-text