A New Data Classification Algorithm for Data-Intensive Computing Environments

In order to solve the problem of how to improve the scalability of data processing capabilities and the data availability which encountered by data mining techniques for Data-intensive computing, a new method of tree learning is presented in this paper. By introducing the MapReduce, the tree learning method based on SPRINT can obtain a well scalability when address large datasets. Moreover, we define the process of split point as a series of distributed computations, which is implemented with the MapReduce model respectively. And a new data structure called class distribution table is introduced to assist the calculation of histogram. Experiments and results analysis shows that the algorithm has strong processing capabilities of data mining for data-intensive computing environments.

Download Full-text

Performance Evaluations of Distributed File Systems for Scientific Big Data in FUSE Environment

Electronics ◽

10.3390/electronics10121471 ◽

2021 ◽

Vol 10 (12) ◽

pp. 1471

Author(s):

Jun-Yeong Lee ◽

Moon-Hyun Kim ◽

Syed Asif Raza Raza Shah ◽

Sang-Un Ahn ◽

Heejun Yoon ◽

...

Keyword(s):

Data Storage ◽

Scale Up ◽

File Systems ◽

Performance Evaluations ◽

Distributed File Systems ◽

Data Intensive Computing ◽

Data Intensive ◽

Tremendous Amount ◽

Computing Environments ◽

And Performance

Data are important and ever growing in data-intensive scientific environments. Such research data growth requires data storage systems that play pivotal roles in data management and analysis for scientific discoveries. Redundant Array of Independent Disks (RAID), a well-known storage technology combining multiple disks into a single large logical volume, has been widely used for the purpose of data redundancy and performance improvement. However, this requires RAID-capable hardware or software to build up a RAID-enabled disk array. In addition, it is difficult to scale up the RAID-based storage. In order to mitigate such a problem, many distributed file systems have been developed and are being actively used in various environments, especially in data-intensive computing facilities, where a tremendous amount of data have to be handled. In this study, we investigated and benchmarked various distributed file systems, such as Ceph, GlusterFS, Lustre and EOS for data-intensive environments. In our experiment, we configured the distributed file systems under a Reliable Array of Independent Nodes (RAIN) structure and a Filesystem in Userspace (FUSE) environment. Our results identify the characteristics of each file system that affect the read and write performance depending on the features of data, which have to be considered in data-intensive computing environments.

Download Full-text

A Comprehensive Survey on Data-Intensive Computing and MapReduce Paradigm in Cloud Computing Environments

Informatics and Communication Technologies for Societal Development ◽

10.1007/978-81-322-1916-3_9 ◽

2014 ◽

pp. 85-93

Author(s):

Girish Neelakanta Iyer ◽

Salaja Silas

Keyword(s):

Cloud Computing ◽

Data Intensive Computing ◽

Data Intensive ◽

Comprehensive Survey ◽

Computing Environments ◽

Mapreduce Paradigm

Download Full-text

Based on the MapReduce Model for Data-intensive Computing of Energy Scheduling Algorithm Strategy

Research Journal of Applied Sciences Engineering and Technology ◽

10.19026/rjaset.5.4275 ◽

2013 ◽

Vol 5 (22) ◽

pp. 5267-5271

Author(s):

Yuqiang Sun ◽

Xin Gao ◽

Huanhuan Cai ◽

Xianmei Chang ◽

Lei Li

Keyword(s):

Scheduling Algorithm ◽

Data Intensive Computing ◽

Data Intensive ◽

Mapreduce Model

Download Full-text

Data classification algorithm for data-intensive computing environments

EURASIP Journal on Wireless Communications and Networking ◽

10.1186/s13638-017-1002-4 ◽

2017 ◽

Vol 2017 (1) ◽

Cited By ~ 1

Author(s):

Tiedong Chen ◽

Shifeng Liu ◽

Daqing Gong ◽

Honghu Gao

Keyword(s):

Data Classification ◽

Classification Algorithm ◽

Data Intensive Computing ◽

Data Intensive ◽

Computing Environments

Download Full-text

Distributed Storage Systems for Data Intensive Computing

Advances in Systems Analysis, Software Engineering, and High Performance Computing - Data Intensive Distributed Computing ◽

10.4018/978-1-61520-971-2.ch004 ◽

2012 ◽

pp. 95-117

Author(s):

Sudharshan S. Vazhkudai ◽

Ali R. Butt ◽

Xiaosong Ma

Keyword(s):

Storage Systems ◽

Distributed Storage ◽

Data Availability ◽

Data Intensive Computing ◽

Application Performance ◽

Data Staging ◽

Data Intensive ◽

Distributed Storage Systems ◽

User Access ◽

Day To Day Operations

In this chapter, the authors present an overview of the utility of distributed storage systems in supporting modern applications that are increasingly becoming data intensive. Their coverage of distributed storage systems is based on the requirements imposed by data intensive computing and not a mere summary of storage systems. To this end, they delve into several aspects of supporting data-intensive analysis, such as data staging, offloading, checkpointing, and end-user access to terabytes of data, and illustrate the use of novel techniques and methodologies for realizing distributed storage systems therein. The data deluge from scientific experiments, observations, and simulations is affecting all of the aforementioned day-to-day operations in data-intensive computing. Modern distributed storage systems employ techniques that can help improve application performance, alleviate I/O bandwidth bottleneck, mask failures, and improve data availability. They present key guiding principles involved in the construction of such storage systems, associated tradeoffs, design, and architecture, all with an eye toward addressing challenges of data-intensive scientific applications. They highlight the concepts involved using several case studies of state-of-the-art storage systems that are currently available in the data-intensive computing landscape.

Download Full-text

An Inter-framework Cache for Diverse Data-Intensive Computing Environments

2015 IEEE International Conference on Smart City/SocialCom/SustainCom (SmartCity) ◽

10.1109/smartcity.2015.192 ◽

2015 ◽

Author(s):

Chun-Yu Wang ◽

Tzu-En Huang ◽

Yu-Tang Huang ◽

Jyh-Biau Chang ◽

Ce-Kuen Shieh

Keyword(s):

Data Intensive Computing ◽

Data Intensive ◽

Computing Environments ◽

Diverse Data

Download Full-text

The National Scalable Cluster Project: Three Lessons about High Performance Data Mining and Data Intensive Computing

Massive Computing - Handbook of Massive Data Sets ◽

10.1007/978-1-4615-0005-6_23 ◽

2002 ◽

pp. 853-874 ◽

Cited By ~ 1

Author(s):

Robert Grossman ◽

Robert Hollebeek

Keyword(s):

Data Mining ◽

High Performance ◽

Performance Data ◽

Data Intensive Computing ◽

Data Intensive

Download Full-text

Data analysis at scale

it - Information Technology ◽

10.1515/itit-2014-1077 ◽

2015 ◽

Vol 57 (2) ◽

Author(s):

Rainer Gemulla

Keyword(s):

Data Mining ◽

Data Analysis ◽

Natural Language ◽

Noisy Data ◽

Large Datasets ◽

Natural Language Text ◽

Data Intensive ◽

Large Complex ◽

Structured Information ◽

Language Text

AbstractMy research focuses on methods to analyze and mine large datasets as well as their practical realizations and applications. The key question of interest to me is: How can we effectively and efficiently distill useful information from large, complex, and potentially noisy datasets? To approach this question, we are developing systems for scalable data analysis and data mining, for working with incomplete and noisy data, for data-intensive optimization, as well as for extracting structured information from natural-language text. This article highlights some of my work in these areas.

Download Full-text

CyVerse for Reproducible Research: RNA-Seq Analysis

Plant Bioinformatics - Methods in Molecular Biology ◽

10.1007/978-1-0716-2067-0_3 ◽

2022 ◽

pp. 57-79

Author(s):

Jason Williams

Keyword(s):

Data Storage ◽

High Performance ◽

Lessons Learned ◽

Data Availability ◽

Reproducible Research ◽

Rna Seq ◽

Data Intensive ◽

Interactive Computing ◽

Computing Environments ◽

Performance Computing

AbstractPosing complex research questions poses complex reproducibility challenges. Datasets may need to be managed over long periods of time. Reliable and secure repositories are needed for data storage. Sharing big data requires advance planning and becomes complex when collaborators are spread across institutions and countries. Many complex analyses require the larger compute resources only provided by cloud and high-performance computing infrastructure. Finally at publication, funder and publisher requirements must be met for data availability and accessibility and computational reproducibility. For all of these reasons, cloud-based cyberinfrastructures are an important component for satisfying the needs of data-intensive research. Learning how to incorporate these technologies into your research skill set will allow you to work with data analysis challenges that are often beyond the resources of individual research institutions. One of the advantages of CyVerse is that there are many solutions for high-powered analyses that do not require knowledge of command line (i.e., Linux) computing. In this chapter we will highlight CyVerse capabilities by analyzing RNA-Seq data. The lessons learned will translate to doing RNA-Seq in other computing environments and will focus on how CyVerse infrastructure supports reproducibility goals (e.g., metadata management, containers), team science (e.g., data sharing features), and flexible computing environments (e.g., interactive computing, scaling).

Download Full-text

Improvement of the MapReduce Model Based on Message Middleware Oriented Data Intensive Computing

2011 Seventh International Conference on Computational Intelligence and Security ◽

10.1109/cis.2011.27 ◽

2011 ◽

Author(s):

Ge Junwei ◽

Xian Jiang ◽

Yiqiu Fang

Keyword(s):

Data Intensive Computing ◽

Data Intensive ◽

Model Based ◽

Message Middleware ◽

Mapreduce Model

Download Full-text