An on-line replication strategy to increase availability in Data Grids

https://doi.org/10.1016/j.future.2007.04.009Get rights and content

Abstract

Data is typically replicated in a Data Grid to improve the job response time and data availability. Strategies for data replication in a Data Grid have previously been proposed, but they typically assume unlimited storage for replicas. In this paper, we address the system-wide data availability problem assuming limited replica storage. We describe two new metrics to evaluate the reliability of the system, and propose an on-line optimizer algorithm that can Minimize the Data Missing Rate (MinDmr) in order to maximize the data availability. Based on MinDmr, we develop four optimizers associated with four different file access prediction functions. Simulation results utilizing the OptorSim show our MinDmr strategies achieve better performance overall than other strategies in terms of the goal of data availability using the two new metrics.

Introduction

The popularity of data-intensive scientific applications, in which millions of files are generated from scientific experiments and thousands of users world-wide access this data, has resulted in the emergence of Grid computing. In a Grid system, the resources of many computers, spanning geographic locations and organizations, are utilized to solve large-scale problems. These geographically distributed systems with loosely coupled jobs can require the management of an extremely large number of data sets. A Grid computing system for processing and managing this very large amount of distributed data is a Data Grid. Examples of Data Grids are the Biomedical Informatics Research Network (BIRN) [22], the Large Hadron Collider (LHC) [23] at the particle physics center CERN, the DataGrid project (EDG) [21] funded by the European Union, now known as the Enabling Grids for E-SciencE project (EGEE) [5], the International Virtual Observatory Alliance (IVOA) Grid community Research Group [24] and physics Data Grids [6], [12]. Data Grids require users to share both data and resources, and the management of such a large volume of data sets has posed a challenging problem in how to make the data more approachable and available to the users.

A common solution to improve availability and file access time in a Data Grid is to replicate the data. When data is replicated, copies of data files are created at many different sites in the Data Grid [3]. Deciding on where and when to make the data copies in the distributed nodes is a problem common to all data replication schemes for Data Grids. Earlier research on data replication [2], [4], [8], [11], [14], [16] focused on decreasing the data access latency and the network bandwidth assumption. As bandwidth and computing capacity have become relatively cheaper, the data access latency can drop dramatically, and how to improve the data availability and system reliability becomes the new focus.

The dynamic behavior of a Grid user, in combination with the large volume of datasets, makes it difficult to make decisions concerning data replication to meet the system availability goal [15]. In a Data Grid system, there are hundreds of clients across the globe who will submit their job requests, each of which will access multiple files to do some type of analysis. In data-intensive applications, when a job accesses a massive-size file, the unavailability of that file can cause the whole job to hang and the potential delay of the job can be unbounded. In large-scale data-intensive systems, hundreds of nodes will be involved and any node failure or network outage can cause potential file unavailability. As a result, there has been an increase in research focusing on how to maximize the file availability. Data replication strategies to improve the data availability have been proposed in [13], [15], but have assumed unlimited storage for replicas.

In this paper, we address the system-wide data availability problem assuming limited replica storage. We present two new data availability metrics, the System File Missing Rate and the System Bytes Missing Rate. (We are not aware of any other research at this time that utilizes any system data availability metrics.) We then model the problem in terms of an optimal solution in a static system. More importantly, for on-line processing of file replication, we propose a novel heuristic algorithm that maximizes the data availability by Minimizing the Data Missing Rate (MinDmr) for limited storage resources without sacrificing the data access latency. Based on MinDmr, we present four optimizers that are associated with four different prediction functions. Our test results on the popular OptorSim [9] show that our four MinDmr replica schemes perform better overall than the Binomial economical replica scheme, Zipf economical replica scheme [2], [4], LFU and LRU for the two metrics.

The rest of the paper is organized as follows. We describe related work in Grid systems in Section 2. Section 3 presents the two measures (the System File Missing Rate and the System Bytes Missing Rate), and discusses the system model. We present our analytical model and the dynamic replica algorithm in Section 4. In Section 5, we describe our simulation results based on the OptorSim, a simulator designed by the European Data Grid Project [5]. In the final section, we conclude our paper and describe future work.

Section snippets

Related work

Work on data availability in Grid systems initially focused on decreasing the data access latency and the network bandwidth assumption. In [14], the six replica strategies (No Replica, Best Client, Cascading Replication, Plain Caching, Caching plus Cascading Replica and Fast Spread) are simulated for the three user access patterns (random access, small temporal locality, and small geographical and temporal locality). The simulation results show that the best strategy has significant savings in

New metrics

While the previous measures of mean available bandwidth and percentage of computer usage are important to the overall functioning of an efficient Grid system, the Grid user is concerned with completing a job with correct data. Any data file access failure can lead to an incorrect result or a job crash. To increase the system reliability by protecting the user from such risk, the Grid system will be compelled to make the data availability as high as possible. Although some users can tolerate

Limited storage system availability algorithms

To maximize system data availability, we can always make enough copies of each file to make the SFMR and SBMR as small as possible by assuming there are infinite storage resources in the Grid (as the replica scheme described in [13]). However, in real life we cannot make such an assumption of infinite storage. Improving the SFMR or SBMR objective, subject to limited storage resources, is the key to achieving system data availability. In addition, we study the on-line replication optimization

Simulator and configuration

We evaluate the performance of our MinDmr replica and replacement strategy using the OptorSim v2.0.0, which was developed by the EU DataGrid Project [5] to test dynamic replica schemes. The designers of the OptorSim state that it was developed to “model the interactions of the individual Grid components of a running Data Grid as realistically as possible” [1]. As a result, we chose OptorSim in order to test our proposed strategies on the most currently available realistic testbed. OptorSim

Conclusions and future work

This paper studies the data availability problem for a Data Grid system with storage constraints. We have presented two metrics to evaluate the availability of the system data. We then discussed how we model the system availability problem and how to transfer this to a classic optimal problem. For on-line replication that can be close to the optimal solution, with the assumption that the Grid storage space is limited, we present our MinDmr replica greedy optimizer algorithm that treats hot and

Ming Lei is a Ph.D. candidate in Computer Science at The University of Alabama. He received a B.A. from Southwest Jiaotong University, China in 2000. His research interests include real-time database systems, networking security, high-performance grid computing and mobile data management.

References (24)

  • Ming Lei, S. Vrbsky, A data replication strategy to increase availability in Data Grids, in: Grid Computing and...
  • T.E. Ng, H. Zhang, Predicting internet network distance with coordinates-based approaches, in: 21st IEEE INFOCOM...
  • Cited by (87)

    • A Systematic Literature Review of the Data Replication Techniques in the Cloud Environments

      2017, Big Data Research
      Citation Excerpt :

      But, it has some drawbacks such as difficulty to collect runtime information of all the data nodes in a complex cloud infrastructure and hard to maintain consistency of data file [42]. Static and dynamic replication algorithms can be further classified into distributed [20,22,68,75] and centralized algorithms [14,38,66,71]. Static replication strategies follow deterministic policies, therefore, the number of replicas and the host node is well-defined and predetermined [42].

    • Data popularity measurements in distributed systems: Survey and design directions

      2016, Journal of Network and Computer Applications
      Citation Excerpt :

      Also, Rahman et al. used this calculation method for several replication algorithms proposed in Rahman et al. (2005), Rahman et al. (2006), Rahman et al. (2008). Suri and Singh (2009), Lei and Vrbsky (2006), Lei et al. (2008), Bsoul et al. (2012), Wuqing et al. (2010) and many other works have used this simple measurement to evaluate the data popularity when designing their replication strategies. They indeed took advantage of its low calculation cost.

    • A comprehensive review of the data replication techniques in the cloud environments: Major trends and future directions

      2016, Journal of Network and Computer Applications
      Citation Excerpt :

      Finally, these mechanisms are compared and summarized in Section 2.2.3. Dynamic strategies for data replication in cloud environments automatically create and delete the replicas according to changes in user access pattern, storage capacity and bandwidth (Chang and Chang, 2008; Doğan, 2009; Lei et al., 2008; Li et al., 2011; Wei et al., 2010). They make intelligent choices about the location of data depending upon the information of the current environment.

    View all citing articles on Scopus

    Ming Lei is a Ph.D. candidate in Computer Science at The University of Alabama. He received a B.A. from Southwest Jiaotong University, China in 2000. His research interests include real-time database systems, networking security, high-performance grid computing and mobile data management.

    Susan V. Vrbsky is an Associate Professor of Computer Science at The University of Alabama. Dr. Vrbsky received her Ph.D. in Computer Science from The University of Illinois, Urbana-Champaign. She received an M.S. in Computer Science from Southern Illinois University, Carbondale, IL and a B.A. from Northwestern University in Evanston, IL. Her research interests include real-time database systems, uncertainty and approximations, Data Grids and mobile data management.

    Xiaoyan Hong is an Assistant Professor in the Department of Computer Science at The University of Alabama. She received her Ph.D. degree in Computer Science from The University of California at Los Angeles (UCLA) in 2003. Dr. Hong’s research is in the area of compute networks, covering mobile and wireless networks, vehicle networks, wireless sensor networks and grid computing networks. Her current research focuses on scalable routing, mobility modeling, privacy and security of wireless networks and resource allocation optimizations.

    Earlier versions of this paper were presented at the Third International Workshop on Networks for Grid Applications GridNets 2006 held in San José, California, USA, October 1–2, 2006.

    View full text