Distributed data mining in grid computing environments

https://doi.org/10.1016/j.future.2006.04.010Get rights and content

Abstract

The computing-intensive data mining for inherently Internet-wide distributed data, referred to as Distributed Data Mining (DDM), calls for the support of a powerful Grid with an effective scheduling framework. DDM often shares the computing paradigm of local processing and global synthesizing. It involves every phase of Data Mining (DM) processes, which makes the workflow of DDM very complex and can be modelled only by a Directed Acyclic Graph (DAG) with multiple data entries. Motivated by the need for a practical solution of the Grid scheduling problem for the DDM workflow, this paper proposes a novel two-phase scheduling framework, including External Scheduling and Internal Scheduling, on a two-level Grid architecture (InterGrid, IntraGrid). Currently a DM IntraGrid, named DMGCE (Data Mining Grid Computing Environment), has been developed with a dynamic scheduling framework for competitive DAGs in a heterogeneous computing environment. This system is implemented in an established Multi-Agent System (MAS) environment, in which the reuse of existing DM algorithms is achieved by encapsulating them into agents. Practical classification problems from oil well logging analysis are used to measure the system performance. The detailed experiment procedure and result analysis are also discussed in this paper.

Introduction

With the vast improvements in wide-area network performance and powerful yet low-cost computers, Grid computing has emerged as a promising attractive computing paradigm. The underlying principle of a computational Grid is the notion of providing computing power transparently in an analogy with electrical power. It aims to aggregate distributed computing resources, hide their specifications and present a homogeneous interface to end users for high performance or high throughput computation. Thus, instead of computing locally, users dispatch their tasks to the Grid and use the remote computing resources. To achieve the promising potentials of computational Grids, an effective and efficient scheduling framework within Grids is fundamentally important.

Recently, DDM has attracted lots of attention among the data mining community [1]. DDM refers to the mining of inherently distributed datasets, aiming to generate global patterns from the union set of locally distributed data. However, the security issue among different local datasets and the huge communication cost in data migration prevent moving all the datasets to a public site. Thus, the algorithms of DDM often adopt a computing paradigm of local processing and global synthesizing, which means that the mining process takes place at a local level and then at a global level where local data mining results are combined to gain global findings. Furthermore, the local processing often concerns multiple phases of data mining, including preprocessing, training and evaluation. The diversity of algorithms in each mining phase makes the DDM workflow so complex that it requires a DAG to model it.

This paper concerns the development of a scheduling framework on a two-level Grid architecture illustrated in Fig. 1 for complex DDM workflows. In the two-level Grid the low level is an IntraGrid while the high level is an InterGrid.

    IntraGrid

    A typical IntraGrid topology exists within a single organization. This organization could be made up of many computers, which are connected by a private high-speed local network. The primary characteristic of an IntraGrid is the bandwidth guarantee on the private network.

    InterGrid

    An InterGrid is an Internet-wide Grid, consisting of multiple IntraGrids connected by WAN. Due to WAN connectivity the communication speed between IntraGrids could be comparably slow.

Our approach for scheduling complex DDM DAG is performed in two phases: external scheduling and internal scheduling. They are involved with InterGrids and IntraGrids, respectively. Issues such as scalability, flexibility, and adaptability are critical for a practical wide-area deployment of Grid systems, which require an efficient and effective scheduling framework. That is the motivation of this study.

The arrangement of the rest of this paper is as follows. Section 2 describes the workflow of distributed classification and formalizes the scheduling problem. Section 3 presents the two-phase scheduling framework, including the external scheduling at the InterGrid level and the internal scheduling within an IntraGrid. Section 4 evaluates the performance of the developed DM IntraGrid by real-world datasets for classification. The related work and conclusions will be given in Section 5. The implementation issues of this DM IntraGrid in the multi-agent system environment is omitted due to space limitations.

Section snippets

Workflow of DDM: A computing paradigm of local processing and global synthesizing

DDM is the process of performing data mining in distributed computing environments, where users, data, hardware and data mining software are geographically distributed. It emerges as an area of research interest to deal with naturally distributed and heterogeneous databases and then to address the scalability bottlenecks of mining very large datasets [3]. A number of distributed algorithms have been developed for different DDM tasks, including distributed classification, clustering and

Two-phase scheduling for distributed classification workflow in an intergrid

The Grid scheduling process of the workflow of distributed classification consists of four steps: partition of distributed classification workflow, external scheduling, internal scheduling and synthesization of local patterns. Partition of distributed classification workflow divides the whole k-entry DAG into k sub-DAGs, each of which represents the corresponding local processing. The external scheduling involves the process of mapping the resultant sub-DAGs onto suitable IntraGrids according

Experimental procedure and results

We first focus on the implementation of the DM IntraGrid, involving internal scheduling only. This IntraGrid, named DMGCE (Data Mining Grid Computing Environment), is developed in a MAS environment MAGE [10] so as to measure the system performance and then to provide this Grid service practically. The evaluation of this system is carried out with practical DM data from well logging analysis. Well logging analysis plays an essential role in petroleum exploration and exploitation. It is used to

Related work

The issues of building a computational Grid for Data Mining have been recently addressed by a number of researchers. WEKA4WS [11] adapts the Weka toolkit to a Grid environment and exposes all the 78 algorithms as WSRF-compliant Web Services. FAEHIM (Federated Analysis Environment for Heterogeneous Intelligent Mining) [12] is Web Services based on a toolkit of DM and mainly focuses on the composition of existing DM Web Services by Triana problem solving environment [13]. The Knowledge Grid [14],

Acknowledgements

Our work is supported by the National Science Foundation of China (No. 60435010), the National 863 Project (No. 2003AA115220), the National 973 Project (No. 2003CB317004) and the Nature Science Foundation of Beijing (No. 4052025). Kevin Lü would like to show his appreciation to Wang Kuan Cheng Science Foundation, Chinese Academy of Sciences for the funding to enable him to conduct this research.

Ping Luo is a Ph.D. student in the Intelligent Science Group at the Institute of Computing Technology at the Chinese Academy of Sciences. His research interests include algorithms and computing architecture of distributed data mining, machine learning, and novel applications in data mining.

References (17)

  • M. Cannataro et al.

    Distributed data mining on the grid

    Future Generation Computer Systems

    (2002)
  • Y. Fu, Distributed data mining: An overview. IEEE TCDP newsletter,...
  • Y. Zhu, A survey on grid scheduling systems, Technical Report, Computer Science Department of Hong Kong University of...
  • S. Krishnaswamy, S. Loke, A. Zaslavsky, Supporting the optimization of distributed data mining by predicting...
  • H. Chen, M. Maheswaran, Distributed dynamic scheduling of composite tasks on grid computing systems, in: Proceedings of...
  • Y.-K. Kwok et al.

    Static scheduling algorithms for allocating directed task graphs to multiprocessors

    ACM Computing Surveys

    (1999)
  • D. Fernandez-Baca

    Allocating modules to processors in a distributed system

    IEEE Transaction on Software Engineering

    (1989)
  • M. Iverson, F. Ozguner, Dynamic, competitive scheduling of multiple dags in a distributed heterogeneous environment,...
There are more references available in the full text version of this article.

Cited by (27)

  • APHID: An architecture for private, high-performance integrated data mining

    2010, Future Generation Computer Systems
    Citation Excerpt :

    BODHI supports the DDM process through communication facilities, independent representation of data models and code mobility. The Data Mining Grid Computing Environment (DMGCE) [45] focuses on scheduling of tasks for multi-agent data mining environments. DMGCE splits up data mining workflows on a two tier level, with one stage occurring at a locally administered grid tier, and another stage occurring among independently administered grids.

  • A grid portal for solving geoscience problems using distributed knowledge discovery services

    2010, Future Generation Computer Systems
    Citation Excerpt :

    Techniques for distributing this part are in progress and have not been implemented in this version of MOSE’. In the literature, there is a number of papers concerning distributed knowledge services on the grid [17–19] and a few works correlated to the use of grid technologies and workflows for coping with Geoscience applications [20]. However, to the best of our knowledge, no work merging the potentialities of the two fields is presented.

  • Towards a general model of the multi-criteria workflow scheduling on the grid

    2009, Future Generation Computer Systems
    Citation Excerpt :

    The workflows considered in Pegasus [44] also have a regular structure; the regular structure of these workflows was the motivation for introducing in Pegasus the idea of workflow partitioning that consists in converting the workflow to a sequence of subworkflows. Dynamic Data Mining workflows addressed by Luo et al. [45,46] are also based on a specific structure, corresponding to the data mining process. Apart from task mapping, also changing the basic workflow structure can be considered as a scheduling method.

View all citing articles on Scopus

Ping Luo is a Ph.D. student in the Intelligent Science Group at the Institute of Computing Technology at the Chinese Academy of Sciences. His research interests include algorithms and computing architecture of distributed data mining, machine learning, and novel applications in data mining.

Dr. Kevin Lü is a senior lecturer at Brunel University, Uxbridge, UK. He has been working on a number of projects on parallel databases, data mining and multi-agent systems. His current research interests include multi-agent systems, data management, data mining, distributed computing and parallel techniques. He has published more than 40 research papers.

Zhongzhi Shi is a Professor at the Institute of Computing Technology, the Chinese Academy of Sciences, leading the Research Group of Intelligent Science. His research interests include intelligence science, multi-agent systems, semantic Web, machine learning and neural computing. He has won a 2nd-Grade National Award at Science and Technology Progress of China in 2002, and two 2nd-Grade Awards at Science and Technology Progress of the Chinese Academy of Sciences in 1998 and 2001, respectively. He is a senior member of IEEE, member of AAAI and ACM, Chair for the WG 12.2 of IFIP. He serves as Vice President for Chinese Association of Artificial Intelligence, Executive President of Chinese Neural Network Council.

Qing He received the Ph.D. degree from Beijing Normal University in 2000. Until 1997, he worked at Hebei University of Science and Technology as an associate professor. He is currently an associate professor at the Institute of Computing and Technology, CAS. His interests include data mining, machine learning, classification, fuzzy clustering.

View full text