An on-line replication strategy to increase availability in Data Grids☆
Introduction
The popularity of data-intensive scientific applications, in which millions of files are generated from scientific experiments and thousands of users world-wide access this data, has resulted in the emergence of Grid computing. In a Grid system, the resources of many computers, spanning geographic locations and organizations, are utilized to solve large-scale problems. These geographically distributed systems with loosely coupled jobs can require the management of an extremely large number of data sets. A Grid computing system for processing and managing this very large amount of distributed data is a Data Grid. Examples of Data Grids are the Biomedical Informatics Research Network (BIRN) [22], the Large Hadron Collider (LHC) [23] at the particle physics center CERN, the DataGrid project (EDG) [21] funded by the European Union, now known as the Enabling Grids for E-SciencE project (EGEE) [5], the International Virtual Observatory Alliance (IVOA) Grid community Research Group [24] and physics Data Grids [6], [12]. Data Grids require users to share both data and resources, and the management of such a large volume of data sets has posed a challenging problem in how to make the data more approachable and available to the users.
A common solution to improve availability and file access time in a Data Grid is to replicate the data. When data is replicated, copies of data files are created at many different sites in the Data Grid [3]. Deciding on where and when to make the data copies in the distributed nodes is a problem common to all data replication schemes for Data Grids. Earlier research on data replication [2], [4], [8], [11], [14], [16] focused on decreasing the data access latency and the network bandwidth assumption. As bandwidth and computing capacity have become relatively cheaper, the data access latency can drop dramatically, and how to improve the data availability and system reliability becomes the new focus.
The dynamic behavior of a Grid user, in combination with the large volume of datasets, makes it difficult to make decisions concerning data replication to meet the system availability goal [15]. In a Data Grid system, there are hundreds of clients across the globe who will submit their job requests, each of which will access multiple files to do some type of analysis. In data-intensive applications, when a job accesses a massive-size file, the unavailability of that file can cause the whole job to hang and the potential delay of the job can be unbounded. In large-scale data-intensive systems, hundreds of nodes will be involved and any node failure or network outage can cause potential file unavailability. As a result, there has been an increase in research focusing on how to maximize the file availability. Data replication strategies to improve the data availability have been proposed in [13], [15], but have assumed unlimited storage for replicas.
In this paper, we address the system-wide data availability problem assuming limited replica storage. We present two new data availability metrics, the System File Missing Rate and the System Bytes Missing Rate. (We are not aware of any other research at this time that utilizes any system data availability metrics.) We then model the problem in terms of an optimal solution in a static system. More importantly, for on-line processing of file replication, we propose a novel heuristic algorithm that maximizes the data availability by Minimizing the Data Missing Rate (MinDmr) for limited storage resources without sacrificing the data access latency. Based on MinDmr, we present four optimizers that are associated with four different prediction functions. Our test results on the popular OptorSim [9] show that our four MinDmr replica schemes perform better overall than the Binomial economical replica scheme, Zipf economical replica scheme [2], [4], LFU and LRU for the two metrics.
The rest of the paper is organized as follows. We describe related work in Grid systems in Section 2. Section 3 presents the two measures (the System File Missing Rate and the System Bytes Missing Rate), and discusses the system model. We present our analytical model and the dynamic replica algorithm in Section 4. In Section 5, we describe our simulation results based on the OptorSim, a simulator designed by the European Data Grid Project [5]. In the final section, we conclude our paper and describe future work.
Section snippets
Related work
Work on data availability in Grid systems initially focused on decreasing the data access latency and the network bandwidth assumption. In [14], the six replica strategies (No Replica, Best Client, Cascading Replication, Plain Caching, Caching plus Cascading Replica and Fast Spread) are simulated for the three user access patterns (random access, small temporal locality, and small geographical and temporal locality). The simulation results show that the best strategy has significant savings in
New metrics
While the previous measures of mean available bandwidth and percentage of computer usage are important to the overall functioning of an efficient Grid system, the Grid user is concerned with completing a job with correct data. Any data file access failure can lead to an incorrect result or a job crash. To increase the system reliability by protecting the user from such risk, the Grid system will be compelled to make the data availability as high as possible. Although some users can tolerate
Limited storage system availability algorithms
To maximize system data availability, we can always make enough copies of each file to make the SFMR and SBMR as small as possible by assuming there are infinite storage resources in the Grid (as the replica scheme described in [13]). However, in real life we cannot make such an assumption of infinite storage. Improving the SFMR or SBMR objective, subject to limited storage resources, is the key to achieving system data availability. In addition, we study the on-line replication optimization
Simulator and configuration
We evaluate the performance of our MinDmr replica and replacement strategy using the OptorSim v2.0.0, which was developed by the EU DataGrid Project [5] to test dynamic replica schemes. The designers of the OptorSim state that it was developed to “model the interactions of the individual Grid components of a running Data Grid as realistically as possible” [1]. As a result, we chose OptorSim in order to test our proposed strategies on the most currently available realistic testbed. OptorSim
Conclusions and future work
This paper studies the data availability problem for a Data Grid system with storage constraints. We have presented two metrics to evaluate the availability of the system data. We then discussed how we model the system availability problem and how to transfer this to a classic optimal problem. For on-line replication that can be close to the optimal solution, with the assumption that the Grid storage space is limited, we present our MinDmr replica greedy optimizer algorithm that treats hot and
Ming Lei is a Ph.D. candidate in Computer Science at The University of Alabama. He received a B.A. from Southwest Jiaotong University, China in 2000. His research interests include real-time database systems, networking security, high-performance grid computing and mobile data management.
References (24)
- et al.
File-based replica management
Future Generation Computer Systems Journal
(2005) - et al.
The impact of data replication on job scheduling performance in the Data Grid
Future Generation Computer Systems Journal
(2006) - et al.
Complete and fragmented replica selection and retrieval in Data Grids
Future Generation Computer Systems Journal
(2007) - et al.
Dynamic replication algorithms for the multi-tier Data Grid
Future Generation Computer Systems Journal
(2005) - et al.
OptorSim—A grid simulator for studying dynamic data replication strategies
International Journal of High Performance Computing Applications
(2003) - et al.
Evaluation of an economy-based file replication strategy for a Data Grid
- et al.
Evaluating scheduling and replica optimisation strategies in OptorSim
- et al.
Towards an economy-based optimisation of file access and replication on a Data Grid
- EU Data Grid project....
- GriPhyN: The Grid physics network project....
Cited by (87)
A new Prefetching-aware Data Replication to decrease access latency in cloud environment
2018, Journal of Systems and SoftwareStorage tier-aware replicative data reorganization with prioritization for efficient workload processing
2018, Future Generation Computer SystemsA Systematic Literature Review of the Data Replication Techniques in the Cloud Environments
2017, Big Data ResearchCitation Excerpt :But, it has some drawbacks such as difficulty to collect runtime information of all the data nodes in a complex cloud infrastructure and hard to maintain consistency of data file [42]. Static and dynamic replication algorithms can be further classified into distributed [20,22,68,75] and centralized algorithms [14,38,66,71]. Static replication strategies follow deterministic policies, therefore, the number of replicas and the host node is well-defined and predetermined [42].
A dynamic, cost-aware, optimized data replication strategy for heterogeneous cloud data centers
2016, Future Generation Computer SystemsData popularity measurements in distributed systems: Survey and design directions
2016, Journal of Network and Computer ApplicationsCitation Excerpt :Also, Rahman et al. used this calculation method for several replication algorithms proposed in Rahman et al. (2005), Rahman et al. (2006), Rahman et al. (2008). Suri and Singh (2009), Lei and Vrbsky (2006), Lei et al. (2008), Bsoul et al. (2012), Wuqing et al. (2010) and many other works have used this simple measurement to evaluate the data popularity when designing their replication strategies. They indeed took advantage of its low calculation cost.
A comprehensive review of the data replication techniques in the cloud environments: Major trends and future directions
2016, Journal of Network and Computer ApplicationsCitation Excerpt :Finally, these mechanisms are compared and summarized in Section 2.2.3. Dynamic strategies for data replication in cloud environments automatically create and delete the replicas according to changes in user access pattern, storage capacity and bandwidth (Chang and Chang, 2008; Doğan, 2009; Lei et al., 2008; Li et al., 2011; Wei et al., 2010). They make intelligent choices about the location of data depending upon the information of the current environment.
Ming Lei is a Ph.D. candidate in Computer Science at The University of Alabama. He received a B.A. from Southwest Jiaotong University, China in 2000. His research interests include real-time database systems, networking security, high-performance grid computing and mobile data management.
Susan V. Vrbsky is an Associate Professor of Computer Science at The University of Alabama. Dr. Vrbsky received her Ph.D. in Computer Science from The University of Illinois, Urbana-Champaign. She received an M.S. in Computer Science from Southern Illinois University, Carbondale, IL and a B.A. from Northwestern University in Evanston, IL. Her research interests include real-time database systems, uncertainty and approximations, Data Grids and mobile data management.
Xiaoyan Hong is an Assistant Professor in the Department of Computer Science at The University of Alabama. She received her Ph.D. degree in Computer Science from The University of California at Los Angeles (UCLA) in 2003. Dr. Hong’s research is in the area of compute networks, covering mobile and wireless networks, vehicle networks, wireless sensor networks and grid computing networks. Her current research focuses on scalable routing, mobility modeling, privacy and security of wireless networks and resource allocation optimizations.
- ☆
Earlier versions of this paper were presented at the Third International Workshop on Networks for Grid Applications GridNets 2006 held in San José, California, USA, October 1–2, 2006.