Distributed data mining in grid computing environments
Introduction
With the vast improvements in wide-area network performance and powerful yet low-cost computers, Grid computing has emerged as a promising attractive computing paradigm. The underlying principle of a computational Grid is the notion of providing computing power transparently in an analogy with electrical power. It aims to aggregate distributed computing resources, hide their specifications and present a homogeneous interface to end users for high performance or high throughput computation. Thus, instead of computing locally, users dispatch their tasks to the Grid and use the remote computing resources. To achieve the promising potentials of computational Grids, an effective and efficient scheduling framework within Grids is fundamentally important.
Recently, DDM has attracted lots of attention among the data mining community [1]. DDM refers to the mining of inherently distributed datasets, aiming to generate global patterns from the union set of locally distributed data. However, the security issue among different local datasets and the huge communication cost in data migration prevent moving all the datasets to a public site. Thus, the algorithms of DDM often adopt a computing paradigm of local processing and global synthesizing, which means that the mining process takes place at a local level and then at a global level where local data mining results are combined to gain global findings. Furthermore, the local processing often concerns multiple phases of data mining, including preprocessing, training and evaluation. The diversity of algorithms in each mining phase makes the DDM workflow so complex that it requires a DAG to model it.
This paper concerns the development of a scheduling framework on a two-level Grid architecture illustrated in Fig. 1 for complex DDM workflows. In the two-level Grid the low level is an IntraGrid while the high level is an InterGrid.
- IntraGrid
A typical IntraGrid topology exists within a single organization. This organization could be made up of many computers, which are connected by a private high-speed local network. The primary characteristic of an IntraGrid is the bandwidth guarantee on the private network.
- InterGrid
An InterGrid is an Internet-wide Grid, consisting of multiple IntraGrids connected by WAN. Due to WAN connectivity the communication speed between IntraGrids could be comparably slow.
Our approach for scheduling complex DDM DAG is performed in two phases: external scheduling and internal scheduling. They are involved with InterGrids and IntraGrids, respectively. Issues such as scalability, flexibility, and adaptability are critical for a practical wide-area deployment of Grid systems, which require an efficient and effective scheduling framework. That is the motivation of this study.
The arrangement of the rest of this paper is as follows. Section 2 describes the workflow of distributed classification and formalizes the scheduling problem. Section 3 presents the two-phase scheduling framework, including the external scheduling at the InterGrid level and the internal scheduling within an IntraGrid. Section 4 evaluates the performance of the developed DM IntraGrid by real-world datasets for classification. The related work and conclusions will be given in Section 5. The implementation issues of this DM IntraGrid in the multi-agent system environment is omitted due to space limitations.
Section snippets
Workflow of DDM: A computing paradigm of local processing and global synthesizing
DDM is the process of performing data mining in distributed computing environments, where users, data, hardware and data mining software are geographically distributed. It emerges as an area of research interest to deal with naturally distributed and heterogeneous databases and then to address the scalability bottlenecks of mining very large datasets [3]. A number of distributed algorithms have been developed for different DDM tasks, including distributed classification, clustering and
Two-phase scheduling for distributed classification workflow in an intergrid
The Grid scheduling process of the workflow of distributed classification consists of four steps: partition of distributed classification workflow, external scheduling, internal scheduling and synthesization of local patterns. Partition of distributed classification workflow divides the whole -entry DAG into sub-DAGs, each of which represents the corresponding local processing. The external scheduling involves the process of mapping the resultant sub-DAGs onto suitable IntraGrids according
Experimental procedure and results
We first focus on the implementation of the DM IntraGrid, involving internal scheduling only. This IntraGrid, named DMGCE (Data Mining Grid Computing Environment), is developed in a MAS environment MAGE [10] so as to measure the system performance and then to provide this Grid service practically. The evaluation of this system is carried out with practical DM data from well logging analysis. Well logging analysis plays an essential role in petroleum exploration and exploitation. It is used to
Related work
The issues of building a computational Grid for Data Mining have been recently addressed by a number of researchers. WEKA4WS [11] adapts the Weka toolkit to a Grid environment and exposes all the 78 algorithms as WSRF-compliant Web Services. FAEHIM (Federated Analysis Environment for Heterogeneous Intelligent Mining) [12] is Web Services based on a toolkit of DM and mainly focuses on the composition of existing DM Web Services by Triana problem solving environment [13]. The Knowledge Grid [14],
Acknowledgements
Our work is supported by the National Science Foundation of China (No. 60435010), the National 863 Project (No. 2003AA115220), the National 973 Project (No. 2003CB317004) and the Nature Science Foundation of Beijing (No. 4052025). Kevin Lü would like to show his appreciation to Wang Kuan Cheng Science Foundation, Chinese Academy of Sciences for the funding to enable him to conduct this research.
Ping Luo is a Ph.D. student in the Intelligent Science Group at the Institute of Computing Technology at the Chinese Academy of Sciences. His research interests include algorithms and computing architecture of distributed data mining, machine learning, and novel applications in data mining.
References (17)
- et al.
Distributed data mining on the grid
Future Generation Computer Systems
(2002) - Y. Fu, Distributed data mining: An overview. IEEE TCDP newsletter,...
- Y. Zhu, A survey on grid scheduling systems, Technical Report, Computer Science Department of Hong Kong University of...
- S. Krishnaswamy, S. Loke, A. Zaslavsky, Supporting the optimization of distributed data mining by predicting...
- H. Chen, M. Maheswaran, Distributed dynamic scheduling of composite tasks on grid computing systems, in: Proceedings of...
- et al.
Static scheduling algorithms for allocating directed task graphs to multiprocessors
ACM Computing Surveys
(1999) Allocating modules to processors in a distributed system
IEEE Transaction on Software Engineering
(1989)- M. Iverson, F. Ozguner, Dynamic, competitive scheduling of multiple dags in a distributed heterogeneous environment,...
Cited by (27)
A fast and resource efficient mining algorithm for discovering frequent patterns in distributed computing environments
2015, Future Generation Computer SystemsAn empirical study on mining sequential patterns in a grid computing environment
2012, Expert Systems with ApplicationsAPHID: An architecture for private, high-performance integrated data mining
2010, Future Generation Computer SystemsCitation Excerpt :BODHI supports the DDM process through communication facilities, independent representation of data models and code mobility. The Data Mining Grid Computing Environment (DMGCE) [45] focuses on scheduling of tasks for multi-agent data mining environments. DMGCE splits up data mining workflows on a two tier level, with one stage occurring at a locally administered grid tier, and another stage occurring among independently administered grids.
A grid portal for solving geoscience problems using distributed knowledge discovery services
2010, Future Generation Computer SystemsCitation Excerpt :Techniques for distributing this part are in progress and have not been implemented in this version of MOSE’. In the literature, there is a number of papers concerning distributed knowledge services on the grid [17–19] and a few works correlated to the use of grid technologies and workflows for coping with Geoscience applications [20]. However, to the best of our knowledge, no work merging the potentialities of the two fields is presented.
Special section on workflow systems and applications in e-Science
2009, Future Generation Computer SystemsTowards a general model of the multi-criteria workflow scheduling on the grid
2009, Future Generation Computer SystemsCitation Excerpt :The workflows considered in Pegasus [44] also have a regular structure; the regular structure of these workflows was the motivation for introducing in Pegasus the idea of workflow partitioning that consists in converting the workflow to a sequence of subworkflows. Dynamic Data Mining workflows addressed by Luo et al. [45,46] are also based on a specific structure, corresponding to the data mining process. Apart from task mapping, also changing the basic workflow structure can be considered as a scheduling method.
Ping Luo is a Ph.D. student in the Intelligent Science Group at the Institute of Computing Technology at the Chinese Academy of Sciences. His research interests include algorithms and computing architecture of distributed data mining, machine learning, and novel applications in data mining.
Dr. Kevin Lü is a senior lecturer at Brunel University, Uxbridge, UK. He has been working on a number of projects on parallel databases, data mining and multi-agent systems. His current research interests include multi-agent systems, data management, data mining, distributed computing and parallel techniques. He has published more than 40 research papers.
Zhongzhi Shi is a Professor at the Institute of Computing Technology, the Chinese Academy of Sciences, leading the Research Group of Intelligent Science. His research interests include intelligence science, multi-agent systems, semantic Web, machine learning and neural computing. He has won a 2nd-Grade National Award at Science and Technology Progress of China in 2002, and two 2nd-Grade Awards at Science and Technology Progress of the Chinese Academy of Sciences in 1998 and 2001, respectively. He is a senior member of IEEE, member of AAAI and ACM, Chair for the WG 12.2 of IFIP. He serves as Vice President for Chinese Association of Artificial Intelligence, Executive President of Chinese Neural Network Council.
Qing He received the Ph.D. degree from Beijing Normal University in 2000. Until 1997, he worked at Hebei University of Science and Technology as an associate professor. He is currently an associate professor at the Institute of Computing and Technology, CAS. His interests include data mining, machine learning, classification, fuzzy clustering.