Parallel and Distributed Data Mining through Parallel Skeletons and Distributed Objects

Data Mining ◽

10.4018/978-1-59140-051-6.ch005 ◽

2011 ◽

pp. 106-141

Author(s):

Massimo Coppola ◽

Marco Vanneschi

Keyword(s):

Data Mining ◽

High Performance ◽

Distributed Databases ◽

Sequential Algorithm ◽

Distributed Data ◽

Programming Environment ◽

Programming Environments ◽

Geographically Distributed ◽

High Level ◽

Performance Results

We consider the application of parallel programming environments to develop portable and efficient high performance data mining (DM) tools. We first assess the need of parallel and distributed DM applications, by pointing out the problems of scalability of some mining techniques and the need to mine large, eventually geographically distributed databases. We discuss the main issues of exploiting parallel and distributed computation for DM algorithms. A high-level programming language enhances the software engineering aspects of parallel DM, and it simplifies the problems of integration with existing sequential and parallel data management systems, thus leading to programming-efficient and high-performance implementations of applications. We describe a programming environment we have implemented that is based on the parallel skeleton model, and we examine the addition of object-like interfaces toward external libraries and system software layers. This kind of abstractions will be included in the forthcoming programming environment ASSIST. In the main part of the chapter, as a proof-of-concept we describe three well-known DM algorithms, Apriori, C4.5, and DBSCAN. For each problem, we explain the sequential algorithm and a structured parallel version, which is discussed and compared to parallel solutions found in the literature. We also discuss the potential gain in performance and expressiveness from the addition of external objects on the basis of the experiments we performed so far. We evaluate the approach with respect to performance results, design, and implementation considerations.

Download Full-text

High-Performance Distributed Data Mining

Chapman & Hall/CRC Data Mining and Knowledge Discovery Series - Next Generation of Data Mining ◽

10.1201/9781420085877.ch8 ◽

2008 ◽

Author(s):

Sanchit Misra ◽

Ramanathan Narayanan ◽

Daniel Honbo ◽

Alok Choudhary

Keyword(s):

Data Mining ◽

High Performance ◽

Distributed Data Mining ◽

Distributed Data

Download Full-text

Using Grids for Distributed Knowledge Discovery

Mathematical Methods for Knowledge Discovery and Data Mining ◽

10.4018/978-1-59904-528-3.ch017 ◽

2011 ◽

pp. 284-298 ◽

Cited By ~ 3

Author(s):

Antonio Congiusta ◽

Domenico Talia ◽

Paolo Trunfio

Keyword(s):

Data Mining ◽

Knowledge Discovery ◽

High Performance ◽

Data Transfer ◽

Grid Services ◽

Distributed Knowledge ◽

Data Intensive ◽

Knowledge Grid ◽

Complex Knowledge ◽

High Level

Knowledge discovery is a compute and data intensive process that allows for finding patterns, trends, and models in large datasets. The Grid can be effectively exploited for deploying knowledge discovery applications because of the high-performance it can offer and its distributed infrastructure. For effective use of Grids in knowledge discovery, the development of middleware is critical to support data management, data transfer, data mining and knowledge representation. To such purpose, we designed the Knowledge Grid, a high-level environment providing for Grid-based knowledge discovery tools and services. Such services allow users to create and manage complex knowledge discovery applications, composed as workflows that integrate data sources and data mining tools provided as distributed Grid services. This chapter describes the Knowledge Grid architecture and describes how its components can be used to design and implement distributed knowledge discovery applications. Then, the chapter describes how the Knowledge Grid services can be made accessible using the Open Grid Services Architecture (OGSA) model.

Download Full-text

Run-Time and Compiler Support for Programming in Adaptive Parallel Environments

Scientific Programming ◽

10.1155/1997/926796 ◽

1997 ◽

Vol 6 (2) ◽

pp. 215-227 ◽

Cited By ~ 11

Author(s):

Guy Edjlali ◽

Gagan Guyagrawal ◽

Alan Sussman ◽

Jim Humphries ◽

Joel Saltz

Keyword(s):

Parallel Programming ◽

High Performance ◽

Navier Stokes ◽

Programming Environments ◽

Data Parallel ◽

Adaptive Environment ◽

Run Time ◽

Time Required ◽

The Cost ◽

Performance Results

For better utilization of computing resources, it is important to consider parallel programming environments in which the number of available processors varies at run-time. In this article, we discuss run-time support for data-parallel programming in such an adaptive environment. Executing programs in an adaptive environment requires redistributing data when the number of processors changes, and also requires determining new loop bounds and communication patterns for the new set of processors. We have developed a run-time library to provide this support. We discuss how the run-time library can be used by compilers of high-performance Fortran (HPF)-like languages to generate code for an adaptive environment. We present performance results for a Navier-Stokes solver and a multigrid template run on a network of workstations and an IBM SP-2. Our experiments show that if the number of processors is not varied frequently, the cost of data redistribution is not significant compared to the time required for the actual computation. Overall, our work establishes the feasibility of compiling HPF for a network of nondedicated workstations, which are likely to be an important resource for parallel programming in the future.

Download Full-text

Arithmetic Research on Data Mining Technology and Associative Rules Mining

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.556-562.3949 ◽

2014 ◽

Vol 556-562 ◽

pp. 3949-3951

Author(s):

Jian Xin Zhu

Keyword(s):

Data Mining ◽

High Performance ◽

Knowledge Engineering ◽

Business Decisions ◽

Depth Data ◽

Source Data ◽

Wide Availability ◽

High Level ◽

Performance Computing ◽

Level Analysis

Data mining is a technique that aims to analyze and understand large source data reveal knowledge hidden in the data. It has been viewed as an important evolution in information processing. Why there have been more attentions to it from researchers or businessmen is due to the wide availability of huge amounts of data and imminent needs for turning such data into valuable information. During the past decade or over, the concepts and techniques on data mining have been presented, and some of them have been discussed in higher levels for the last few years. Data mining involves an integration of techniques from database, artificial intelligence, machine learning, statistics, knowledge engineering, object-oriented method, information retrieval, high-performance computing and visualization. Essentially, data mining is high-level analysis technology and it has a strong purpose for business profiting. Unlike OLTP applications, data mining should provide in-depth data analysis and the supports for business decisions.

Download Full-text

Grid-Based Platform for Mining Association Rules

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.34-35.1961 ◽

2010 ◽

Vol 34-35 ◽

pp. 1961-1965

Author(s):

You Qu Chang ◽

Guo Ping Hou ◽

Huai Yong Deng

Keyword(s):

Data Mining ◽

Large Datasets ◽

Distributed Data Mining ◽

Distributed Data ◽

Data Set ◽

Efficiency And Effectiveness ◽

Geographically Distributed ◽

Commercial Applications ◽

Rule Interestingness ◽

Grid Based

distributed data mining is widely used in industrial and commercial applications to analyze large datasets maintained over geographically distributed sites. This paper discusses the disadvantages of existing distributed data mining systems, and puts forward a distributed data mining platform based grid computing. The experiments done on a data set showed that the proposed approach produces meaningful results and has reasonable efficiency and effectiveness providing a trade-off between runtime and rule interestingness.

Download Full-text

A special issue of Journal of Parallel and Distributed Computing: Models and algorithms for high-performance distributed data mining

Journal of Parallel and Distributed Computing ◽

10.1016/j.jpdc.2011.02.005 ◽

2011 ◽

Vol 71 (5) ◽

pp. 729-730

Author(s):

Alfredo Cuzzocrea

Keyword(s):

Data Mining ◽

Distributed Computing ◽

High Performance ◽

Distributed Data Mining ◽

Distributed Data ◽

Special Issue ◽

Parallel And Distributed Computing ◽

Computing Models

Download Full-text

AN OPTIMIZED ARM SCHEME FOR DISTINCT NETWORK DATA SET

International Journal of Computer and Communication Technology ◽

10.47893/ijcct.2015.1302 ◽

2015 ◽

pp. 191-195

Author(s):

K.GANESH KUMAR ◽

H.VIGNESH RAMAMOORTHY ◽

M.PREM KUMAR ◽

S. SUDHA

Keyword(s):

Data Mining ◽

Association Rule ◽

Association Rule Mining ◽

Distributed Databases ◽

Research Area ◽

Sequential Algorithm ◽

Data Sets ◽

Rule Mining ◽

Data Set ◽

Communication Costs

Association rule mining (ARM) discovers correlations between different item sets in a transaction database. It provides important knowledge in business for decision makers. Association rule mining is an active data mining research area and most ARM algorithms cater to a centralized environment. Centralized data mining to discover useful patterns in distributed databases isn't always feasible because merging data sets from different sites incurs huge network communication costs. In this paper, an improved algorithm based on good performance level for data mining is being proposed. In local sites, it runs the application based on the improved LMatrix algorithm, which is used to calculate local support counts. Local Site also finds a center site to manage every message exchanged to obtain all globally frequent item sets. It also reduces the time of scan of partition database by using LMatrix which increases the performance of the algorithm. Therefore, the research is to develop a distributed algorithm for geographically distributed data sets that reduces communication costs, superior running efficiency, and stronger scalability than direct application of a sequential algorithm in distributed databases.

Download Full-text

Apache Nemo: A Framework for Optimizing Distributed Data Processing

ACM Transactions on Computer Systems ◽

10.1145/3468144 ◽

2020 ◽

Vol 38 (3-4) ◽

pp. 1-31

Author(s):

Won Wook Song ◽

Youngseok Yang ◽

Jeongyoon Eo ◽

Jangho Seo ◽

Joo Yeon Kim ◽

...

Keyword(s):

Data Processing ◽

High Performance ◽

Programming Model ◽

Compiler Optimization ◽

Ease Of Use ◽

Distributed Data ◽

Performance Improvements ◽

Distributed Data Processing ◽

Fine Control ◽

High Level

Optimizing scheduling and communication of distributed data processing for resource and data characteristics is crucial for achieving high performance. Existing approaches to such optimizations largely fall into two categories. First, distributed runtimes provide low-level policy interfaces to apply the optimizations, but do not ensure the maintenance of correct application semantics and thus often require significant effort to use. Second, policy interfaces that extend a high-level application programming model ensure correctness, but do not provide sufficient fine control. We describe Apache Nemo, an optimization framework for distributed dataflow processing that provides fine control for high performance and also ensures correctness for ease of use. We combine several techniques to achieve this, including an intermediate representation of dataflow, compiler optimization passes, and runtime extensions. Our evaluation results show that Nemo enables composable and reusable optimizations that bring performance improvements on par with existing specialized runtimes tailored for a specific deployment scenario. Apache Nemo is open-sourced at https://nemo.apache.org as an Apache incubator project.

Download Full-text

Comparative Study between Parallel K-Means and Parallel K-Medoids with Message Passing Interface (MPI)

International Journal on Information and Communication Technology (IJoICT) ◽

10.21108/ijoict.2016.22.86 ◽

2017 ◽

Vol 2 (2) ◽

pp. 27

Author(s):

Fhira Nhita

Keyword(s):

Data Mining ◽

Message Passing ◽

High Performance ◽

Message Passing Interface ◽

Clustering Algorithms ◽

Computation Time ◽

Sequential Algorithm ◽

Data Mining Technique ◽

Combination Technology ◽

Performance Computing

<p>Data mining is a combination technology for analyze a useful information from dataset using some technique such as classification, clustering, and etc. Clustering is one of the most used data mining technique these day. K-Means and K-Medoids is one of clustering algorithms that mostly used because it’s easy implementation, efficient, and also present good results. Besides mining important information, the needs of time spent when mining data is also a concern in today era considering the real world applications produce huge volume of data. This research analyzed the result from K-Means and K-Medoids algorithm and time performance using High Performance Computing (HPC) Cluster to parallelize K-Means and K-Medoids algorithms and using Message Passing Interface (MPI) library. The results shown that K-Means algorithm gives smaller SSE than K-Medoids. And also parallel algorithm that used MPI gives faster computation time than sequential algorithm.</p>

Download Full-text

WebFlow - High-Level Programming Environment and Visual Authoring Toolkit for High Performance Distributed Computing

Proceedings of the IEEE/ACM SC98 Conference ◽

10.1109/sc.1998.10046 ◽

1998 ◽

Cited By ~ 16

Author(s):

E. Akarsu ◽

G.C. Fox ◽

W. Furmanski ◽

T. Haupt

Keyword(s):

Distributed Computing ◽

High Performance ◽

Programming Environment ◽

High Performance Distributed Computing ◽

High Level

Download Full-text