Towards Transparent Throughput Elasticity for IaaS Cloud Storage

Storage elasticity on the cloud is a crucial feature in the age of data-intensive computing, especially when considering fluctuations of I/O throughput. In this chapter, the authors explore how to transparently boost the I/O bandwidth during peak utilization to deliver high performance without over-provisioning storage resources. The proposal relies on the idea of leveraging short-lived virtual disks of better performance characteristics (and more expensive) to act during peaks as a caching layer for the persistent virtual disks where the application data is stored during runtime. They show how this idea can be achieved efficiently at the block-device level, using a caching mechanism that leverages iterative behavior and learns from past experience. Second, they introduce a corresponding performance and cost prediction methodology. They demonstrate the benefits of our proposal both for micro-benchmarks and for two real-life applications using large-scale experiments. They conclude with a discussion on how these techniques can be generalized for increasingly complex landscape of modern cloud storage.

Download Full-text

Enabling Large-Scale Biomedical Analysis in the Cloud

BioMed Research International ◽

10.1155/2013/185679 ◽

2013 ◽

Vol 2013 ◽

pp. 1-6 ◽

Cited By ~ 10

Author(s):

Ying-Chih Lin ◽

Chin-Sheng Yu ◽

Yen-Jen Lin

Keyword(s):

High Performance ◽

Large Scale ◽

Computing System ◽

Biomedical Data ◽

Data Intensive Computing ◽

Biomedical Analysis ◽

Data Intensive ◽

Large Scale Data ◽

Performance Computing ◽

Scale Data

Recent progress in high-throughput instrumentations has led to an astonishing growth in both volume and complexity of biomedical data collected from various sources. The planet-size data brings serious challenges to the storage and computing technologies. Cloud computing is an alternative to crack the nut because it gives concurrent consideration to enable storage and high-performance computing on large-scale data. This work briefly introduces the data intensive computing system and summarizes existing cloud-based resources in bioinformatics. These developments and applications would facilitate biomedical research to make the vast amount of diversification data meaningful and usable.

Download Full-text

Large-Scale Data-Intensive Computing

Large-Scale Computing ◽

10.1002/9781118130506.ch7 ◽

2012 ◽

pp. 131-140

Author(s):

Mark Parsons

Keyword(s):

Large Scale ◽

Data Intensive Computing ◽

Data Intensive ◽

Large Scale Data ◽

Scale Data

Download Full-text

The Large Scale Data Facility: Data Intensive Computing for Scientific Experiments

2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum ◽

10.1109/ipdps.2011.286 ◽

2011 ◽

Cited By ~ 9

Author(s):

Ariel O. Garcia ◽

Serguei Bourov ◽

Ahmad Hammad ◽

Jos van Wezel ◽

Bernhard Neumair ◽

...

Keyword(s):

Large Scale ◽

Data Intensive Computing ◽

Data Intensive ◽

Large Scale Data ◽

Scientific Experiments ◽

Scale Data

Download Full-text

Data Intensive Computing for Bioinformatics

Bioinformatics ◽

10.4018/978-1-4666-3604-0.ch016 ◽

2013 ◽

pp. 287-321

Author(s):

Judy Qiu ◽

Jaliya Ekanayake ◽

Thilina Gunarathne ◽

Jong Youl Choi ◽

Seung-Hee Bae ◽

...

Keyword(s):

High Performance ◽

Large Scale ◽

Programming Model ◽

Life Sciences ◽

Programming Models ◽

Data Sets ◽

Data Intensive Computing ◽

Data Intensive ◽

Data Mining Algorithms ◽

Model Combining

Data intensive computing, cloud computing, and multicore computing are converging as frontiers to address massive data problems with hybrid programming models and/or runtimes including MapReduce, MPI, and parallel threading on multicore platforms. A major challenge is to utilize these technologies and large-scale computing resources effectively to advance fundamental science discoveries such as those in Life Sciences. The recently developed next-generation sequencers have enabled large-scale genome sequencing in areas such as environmental sample sequencing leading to metagenomic studies of collections of genes. Metagenomic research is just one of the areas that present a significant computational challenge because of the amount and complexity of data to be processed. This chapter discusses the use of innovative data-mining algorithms and new programming models for several Life Sciences applications. The authors particularly focus on methods that are applicable to large data sets coming from high throughput devices of steadily increasing power. They show results for both clustering and dimension reduction algorithms, and the use of MapReduce on modest size problems. They identify two key areas where further research is essential, and propose to develop new O(NlogN) complexity algorithms suitable for the analysis of millions of sequences. They suggest Iterative MapReduce as a promising programming model combining the best features of MapReduce with those of high performance environments such as MPI.

Download Full-text

Distributed data provenance for large-scale data-intensive computing

2013 IEEE International Conference on Cluster Computing (CLUSTER) ◽

10.1109/cluster.2013.6702685 ◽

2013 ◽

Cited By ~ 28

Author(s):

Dongfang Zhao ◽

Chen Shou ◽

Tanu Maliky ◽

Ioan Raicu

Keyword(s):

Large Scale ◽

Data Provenance ◽

Distributed Data ◽

Data Intensive Computing ◽

Data Intensive ◽

Large Scale Data ◽

Scale Data

Download Full-text

Passive Network Performance Estimation for Large-Scale, Data-Intensive Computing

IEEE Transactions on Parallel and Distributed Systems ◽

10.1109/tpds.2010.201 ◽

2011 ◽

Vol 22 (8) ◽

pp. 1365-1373 ◽

Cited By ~ 3

Author(s):

Jinoh Kim ◽

Abhishek Chandra ◽

Jon B. Weissman

Keyword(s):

Large Scale ◽

Network Performance ◽

Performance Estimation ◽

Data Intensive Computing ◽

Data Intensive ◽

Large Scale Data ◽

Passive Network ◽

Scale Data

Download Full-text

Data Intensive Computing for Bioinformatics

Advances in Systems Analysis, Software Engineering, and High Performance Computing - Data Intensive Distributed Computing ◽

10.4018/978-1-61520-971-2.ch010 ◽

2012 ◽

pp. 207-241 ◽

Cited By ~ 1

Author(s):

Judy Qiu ◽

Jaliya Ekanayake ◽

Thilina Gunarathne ◽

Jong Youl Choi ◽

Seung-Hee Bae ◽

...

Keyword(s):

High Performance ◽

Large Scale ◽

Programming Model ◽

Life Sciences ◽

Programming Models ◽

Data Sets ◽

Data Intensive Computing ◽

Data Intensive ◽

Data Mining Algorithms ◽

Model Combining

Data intensive computing, cloud computing, and multicore computing are converging as frontiers to address massive data problems with hybrid programming models and/or runtimes including MapReduce, MPI, and parallel threading on multicore platforms. A major challenge is to utilize these technologies and large-scale computing resources effectively to advance fundamental science discoveries such as those in Life Sciences. The recently developed next-generation sequencers have enabled large-scale genome sequencing in areas such as environmental sample sequencing leading to metagenomic studies of collections of genes. Metagenomic research is just one of the areas that present a significant computational challenge because of the amount and complexity of data to be processed. This chapter discusses the use of innovative data-mining algorithms and new programming models for several Life Sciences applications. The authors particularly focus on methods that are applicable to large data sets coming from high throughput devices of steadily increasing power. They show results for both clustering and dimension reduction algorithms, and the use of MapReduce on modest size problems. They identify two key areas where further research is essential, and propose to develop new O(NlogN) complexity algorithms suitable for the analysis of millions of sequences. They suggest Iterative MapReduce as a promising programming model combining the best features of MapReduce with those of high performance environments such as MPI.

Download Full-text

Large-Scale Text Clustering Based on Improved K-Means Algorithm in the Storm Platform

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.543-547.1913 ◽

2014 ◽

Vol 543-547 ◽

pp. 1913-1916

Author(s):

Sheng Hang Wu ◽

Zhe Wang ◽

Ming Yuan He ◽

Huai Lin Dong

Keyword(s):

Large Scale ◽

Clustering Algorithm ◽

Distributed Processing ◽

Text Clustering ◽

Research Field ◽

Data Intensive Computing ◽

Operation Mechanism ◽

Data Intensive ◽

And Performance ◽

Algorithm Base

With the web information dramatically increases, Distributed processing of mass data through a cluster have been the focus of research field. An efficient distributed algorithm is the determinant of the scalability and performance in data analyses. This dissertation firstly studies the operation mechanism of Storm, which is a simplified distributed and real-time computation platform. Based on the Storm platform, an improved K-Means algorithm which could be used for data intensive computing is designed and implemented. Finally, the experience results show that the K-Means clustering algorithm base on Storm platform could obtain a higher performance in experience and improve the effectiveness and accuracy in large-scale text clustering.

Download Full-text

The bounds of the distributed data-intensive computing systems

Pollack Periodica ◽

10.1556/pollack.2.2007.s.8 ◽

2007 ◽

Vol 2 (Supplement 1) ◽

pp. 85-96 ◽

Cited By ~ 1

Author(s):

Antal Buza

Keyword(s):

Distributed Data ◽

Data Intensive Computing ◽

Computing Systems ◽

Data Intensive

Download Full-text