Distributed data mining in grid computing environments

doi:10.1016/j.future.2006.04.010

Future Generation Computer Systems

Volume 23, Issue 1, 1 January 2007, Pages 84-91

https://doi.org/10.1016/j.future.2006.04.010 Get rights and content

Abstract

The computing-intensive data mining for inherently Internet-wide distributed data, referred to as Distributed Data Mining (DDM), calls for the support of a powerful Grid with an effective scheduling framework. DDM often shares the computing paradigm of local processing and global synthesizing. It involves every phase of Data Mining (DM) processes, which makes the workflow of DDM very complex and can be modelled only by a Directed Acyclic Graph (DAG) with multiple data entries. Motivated by the need for a practical solution of the Grid scheduling problem for the DDM workflow, this paper proposes a novel two-phase scheduling framework, including External Scheduling and Internal Scheduling, on a two-level Grid architecture (InterGrid, IntraGrid). Currently a DM IntraGrid, named DMGCE (Data Mining Grid Computing Environment), has been developed with a dynamic scheduling framework for competitive DAGs in a heterogeneous computing environment. This system is implemented in an established Multi-Agent System (MAS) environment, in which the reuse of existing DM algorithms is achieved by encapsulating them into agents. Practical classification problems from oil well logging analysis are used to measure the system performance. The detailed experiment procedure and result analysis are also discussed in this paper.

Introduction

With the vast improvements in wide-area network performance and powerful yet low-cost computers, Grid computing has emerged as a promising attractive computing paradigm. The underlying principle of a computational Grid is the notion of providing computing power transparently in an analogy with electrical power. It aims to aggregate distributed computing resources, hide their specifications and present a homogeneous interface to end users for high performance or high throughput computation. Thus, instead of computing locally, users dispatch their tasks to the Grid and use the remote computing resources. To achieve the promising potentials of computational Grids, an effective and efficient scheduling framework within Grids is fundamentally important.

Recently, DDM has attracted lots of attention among the data mining community [1]. DDM refers to the mining of inherently distributed datasets, aiming to generate global patterns from the union set of locally distributed data. However, the security issue among different local datasets and the huge communication cost in data migration prevent moving all the datasets to a public site. Thus, the algorithms of DDM often adopt a computing paradigm of local processing and global synthesizing, which means that the mining process takes place at a local level and then at a global level where local data mining results are combined to gain global findings. Furthermore, the local processing often concerns multiple phases of data mining, including preprocessing, training and evaluation. The diversity of algorithms in each mining phase makes the DDM workflow so complex that it requires a DAG to model it.

This paper concerns the development of a scheduling framework on a two-level Grid architecture illustrated in Fig. 1 for complex DDM workflows. In the two-level Grid the low level is an IntraGrid while the high level is an InterGrid.

IntraGrid

A typical IntraGrid topology exists within a single organization. This organization could be made up of many computers, which are connected by a private high-speed local network. The primary characteristic of an IntraGrid is the bandwidth guarantee on the private network.

InterGrid

An InterGrid is an Internet-wide Grid, consisting of multiple IntraGrids connected by WAN. Due to WAN connectivity the communication speed between IntraGrids could be comparably slow.

Our approach for scheduling complex DDM DAG is performed in two phases: external scheduling and internal scheduling. They are involved with InterGrids and IntraGrids, respectively. Issues such as scalability, flexibility, and adaptability are critical for a practical wide-area deployment of Grid systems, which require an efficient and effective scheduling framework. That is the motivation of this study.

The arrangement of the rest of this paper is as follows. Section 2 describes the workflow of distributed classification and formalizes the scheduling problem. Section 3 presents the two-phase scheduling framework, including the external scheduling at the InterGrid level and the internal scheduling within an IntraGrid. Section 4 evaluates the performance of the developed DM IntraGrid by real-world datasets for classification. The related work and conclusions will be given in Section 5. The implementation issues of this DM IntraGrid in the multi-agent system environment is omitted due to space limitations.

Section snippets

Workflow of DDM: A computing paradigm of local processing and global synthesizing

DDM is the process of performing data mining in distributed computing environments, where users, data, hardware and data mining software are geographically distributed. It emerges as an area of research interest to deal with naturally distributed and heterogeneous databases and then to address the scalability bottlenecks of mining very large datasets [3]. A number of distributed algorithms have been developed for different DDM tasks, including distributed classification, clustering and

Two-phase scheduling for distributed classification workflow in an intergrid

The Grid scheduling process of the workflow of distributed classification consists of four steps: partition of distributed classification workflow, external scheduling, internal scheduling and synthesization of local patterns. Partition of distributed classification workflow divides the whole $k$ -entry DAG into $k$ sub-DAGs, each of which represents the corresponding local processing. The external scheduling involves the process of mapping the resultant sub-DAGs onto suitable IntraGrids according

Experimental procedure and results

We first focus on the implementation of the DM IntraGrid, involving internal scheduling only. This IntraGrid, named DMGCE (Data Mining Grid Computing Environment), is developed in a MAS environment MAGE [10] so as to measure the system performance and then to provide this Grid service practically. The evaluation of this system is carried out with practical DM data from well logging analysis. Well logging analysis plays an essential role in petroleum exploration and exploitation. It is used to

Related work

The issues of building a computational Grid for Data Mining have been recently addressed by a number of researchers. WEKA4WS [11] adapts the Weka toolkit to a Grid environment and exposes all the 78 algorithms as WSRF-compliant Web Services. FAEHIM (Federated Analysis Environment for Heterogeneous Intelligent Mining) [12] is Web Services based on a toolkit of DM and mainly focuses on the composition of existing DM Web Services by Triana problem solving environment [13]. The Knowledge Grid [14],

Acknowledgements

Our work is supported by the National Science Foundation of China (No. 60435010), the National 863 Project (No. 2003AA115220), the National 973 Project (No. 2003CB317004) and the Nature Science Foundation of Beijing (No. 4052025). Kevin Lü would like to show his appreciation to Wang Kuan Cheng Science Foundation, Chinese Academy of Sciences for the funding to enable him to conduct this research.

Ping Luo is a Ph.D. student in the Intelligent Science Group at the Institute of Computing Technology at the Chinese Academy of Sciences. His research interests include algorithms and computing architecture of distributed data mining, machine learning, and novel applications in data mining.

References (17)

M. Cannataro et al.
Distributed data mining on the grid
Future Generation Computer Systems
(2002)
Y. Fu, Distributed data mining: An overview. IEEE TCDP newsletter,...
Y. Zhu, A survey on grid scheduling systems, Technical Report, Computer Science Department of Hong Kong University of...
S. Krishnaswamy, S. Loke, A. Zaslavsky, Supporting the optimization of distributed data mining by predicting...
H. Chen, M. Maheswaran, Distributed dynamic scheduling of composite tasks on grid computing systems, in: Proceedings of...
Y.-K. Kwok et al.
Static scheduling algorithms for allocating directed task graphs to multiprocessors
ACM Computing Surveys
(1999)
D. Fernandez-Baca
Allocating modules to processors in a distributed system
IEEE Transaction on Software Engineering
(1989)
M. Iverson, F. Ozguner, Dynamic, competitive scheduling of multiple dags in a distributed heterogeneous environment,...

There are more references available in the full text version of this article.

Cited by (27)

A fast and resource efficient mining algorithm for discovering frequent patterns in distributed computing environments
2015, Future Generation Computer Systems
The advancement of electronic technology enables us to collect logs from various devices. Such logs require detailed analysis in order to be broadly useful. Data mining is a technique that has been widely used to extract hidden information from such data. Data mining is mainly composed of association rules mining, sequent pattern mining, classification and clustering. Association rules mining has attracted significant attention and been successfully applied to various fields. Although the past studies can effectively discover frequent patterns to deduce association rules, execution efficiency is still a critical problem. To speed up execution, many methods using parallel and distributed computing technology have been proposed in recent years. Most of the past studies focused on parallelizing the workload in a high end machine or in distributed computing environments like grid or cloud computing systems; however, very few of them discuss how to efficiently determine the appropriate number of computing nodes, considering execution efficiency and load balancing. An intuition is that execution speed is proportional to the number of computing nodes—that is, more the number of computing nodes, faster is the execution speed. However, this is incorrect for such algorithms because of the inherently algorithmic design. Allocating too many computing nodes can lead to high execution time. In addition to the execution inefficiency, inappropriate resource allocation is a waste of computing power and network bandwidth. At the same time, load cannot be effectively distributed if there are too few nodes allocated. In this paper, we propose a fast, load balancing and resource efficient algorithm named FLR-Mining for discovering frequent patterns in distributed computing systems. FLR-Mining is capable of determining the appropriate number of computing nodes automatically and achieving better load balancing as compared with existing methods. Through empirical evaluation, FLR-Mining is shown to deliver excellent performance in terms of execution efficiency and load balancing.
An empirical study on mining sequential patterns in a grid computing environment
2012, Expert Systems with Applications
Mining sequential patterns (MSP) is an important task for knowledge discovery and data mining (KDD). Like in most KDD tasks, MSP also invokes a number of iterations for generating, adjusting, and comparing data. This paper presents an empirical study on deploying MSP in a grid computing environment and demonstrates the effectiveness and performance improvements gained in this deployment. GSP, which is a typical MSP method, is used as the mining algorithm to be investigated. A grid computing environment is designed and implemented, where all GSP functions are organized as loosely coupled web-services. MSP is achieved through the cooperation of these web-services using the divide-and-conquer strategy. Several monitoring mechanisms are developed to help manage the MSP process. The experimental results show that the proposed grid computing environment provides a flexible and efficient platform for MSP.
APHID: An architecture for private, high-performance integrated data mining
2010, Future Generation Computer Systems
Citation Excerpt :
BODHI supports the DDM process through communication facilities, independent representation of data models and code mobility. The Data Mining Grid Computing Environment (DMGCE) [45] focuses on scheduling of tasks for multi-agent data mining environments. DMGCE splits up data mining workflows on a two tier level, with one stage occurring at a locally administered grid tier, and another stage occurring among independently administered grids.
While the emerging field of privacy preserving data mining (PPDM) will enable many new data mining applications, it suffers from several practical difficulties. PPDM algorithms are challenging to develop and computationally intensive to execute. Developers need convenient abstractions to simplify the engineering of PPDM applications. The individual parties involved in the data mining process need a way to bring high-performance, parallel computers to bear on the computationally intensive parts of the PPDM tasks. This paper discusses APHID (Architecture for Private and High-performance Integrated Data mining), a practical architecture and software framework for developing and executing large scale PPDM applications. At one tier, the system supports simplified use of cluster and grid resources, and at another tier, the system abstracts communication for easy PPDM algorithm development. This paper offers a detailed analysis of the challenges in developing PPDM algorithms with existing frameworks, and motivates the design of a new infrastructure based on these challenges.
A grid portal for solving geoscience problems using distributed knowledge discovery services
2010, Future Generation Computer Systems
Citation Excerpt :
Techniques for distributing this part are in progress and have not been implemented in this version of MOSE’. In the literature, there is a number of papers concerning distributed knowledge services on the grid [17–19] and a few works correlated to the use of grid technologies and workflows for coping with Geoscience applications [20]. However, to the best of our knowledge, no work merging the potentialities of the two fields is presented.
This paper describes our research effort to employ Grid technologies to enable the development of geoscience applications by integrating workflow technologies with data mining resources and a portal framework in unique work environment called MOSÈ. Using MOSÈ, a user can easily compose and execute geo-workflows for analyzing and managing natural disasters such as landslides, earthquakes, floods, wildfires, etc. MOSÈ is designed to be applicable both for the implementation of response strategies when emergencies occur and for disaster prevention. It takes advantage of the standardized resource access and workflow support for loosely coupled software components provided by web/grid services technologies. The integration of workflows with data mining services significantly improves data analysis. Geospatial data management and mining are critical areas of modern-day geosciences research. An important challenge for geospatial information mining is the distributed nature of the data. MOSÈ provides knowledge discovery services based on the WEKA data mining library and novel distributed data mining algorithms for spatial data analysis. A P2P bio-inspired algorithm for distributed spatial clustering as an example of distributed knowledge discovery service for intensive data analysis is presented. A real case application for the analysis of landslide hazard areas in the Campania Region near the Sarno area shows the advantages of using the portal.
Special section on workflow systems and applications in e-Science
2009, Future Generation Computer Systems
Towards a general model of the multi-criteria workflow scheduling on the grid
2009, Future Generation Computer Systems
Citation Excerpt :
The workflows considered in Pegasus [44] also have a regular structure; the regular structure of these workflows was the motivation for introducing in Pegasus the idea of workflow partitioning that consists in converting the workflow to a sequence of subworkflows. Dynamic Data Mining workflows addressed by Luo et al. [45,46] are also based on a specific structure, corresponding to the data mining process. Apart from task mapping, also changing the basic workflow structure can be considered as a scheduling method.
Workflow scheduling on the Grid becomes more challenging when multiple scheduling criteria are considered. Existing studies provide different approaches to the multi-criteria Grid workflow scheduling problem, and address different variants of the problem. A profound understanding of the problem’s nature can be an important step towards more generic scheduling approaches. Based on the related work and on our own experience, we propose several novel taxonomies of the problem, considering five facets: workflow model, scheduling criteria, scheduling process, resource model, and task model. We make a survey of the existing related work, and classify it according to the proposed taxonomies, identifying the most common use cases and the areas that have not been sufficiently explored yet.

View all citing articles on Scopus

Dr. Kevin Lü is a senior lecturer at Brunel University, Uxbridge, UK. He has been working on a number of projects on parallel databases, data mining and multi-agent systems. His current research interests include multi-agent systems, data management, data mining, distributed computing and parallel techniques. He has published more than 40 research papers.

Zhongzhi Shi is a Professor at the Institute of Computing Technology, the Chinese Academy of Sciences, leading the Research Group of Intelligent Science. His research interests include intelligence science, multi-agent systems, semantic Web, machine learning and neural computing. He has won a 2nd-Grade National Award at Science and Technology Progress of China in 2002, and two 2nd-Grade Awards at Science and Technology Progress of the Chinese Academy of Sciences in 1998 and 2001, respectively. He is a senior member of IEEE, member of AAAI and ACM, Chair for the WG 12.2 of IFIP. He serves as Vice President for Chinese Association of Artificial Intelligence, Executive President of Chinese Neural Network Council.

Qing He received the Ph.D. degree from Beijing Normal University in 2000. Until 1997, he worked at Hebei University of Science and Technology as an associate professor. He is currently an associate professor at the Institute of Computing and Technology, CAS. His interests include data mining, machine learning, classification, fuzzy clustering.

View full text