An on-line replication strategy to increase availability in Data Grids

doi:10.1016/j.future.2007.04.009

Future Generation Computer Systems

Volume 24, Issue 2, February 2008, Pages 85-98

https://doi.org/10.1016/j.future.2007.04.009 Get rights and content

Abstract

Data is typically replicated in a Data Grid to improve the job response time and data availability. Strategies for data replication in a Data Grid have previously been proposed, but they typically assume unlimited storage for replicas. In this paper, we address the system-wide data availability problem assuming limited replica storage. We describe two new metrics to evaluate the reliability of the system, and propose an on-line optimizer algorithm that can Minimize the Data Missing Rate (MinDmr) in order to maximize the data availability. Based on MinDmr, we develop four optimizers associated with four different file access prediction functions. Simulation results utilizing the OptorSim show our MinDmr strategies achieve better performance overall than other strategies in terms of the goal of data availability using the two new metrics.

Introduction

The popularity of data-intensive scientific applications, in which millions of files are generated from scientific experiments and thousands of users world-wide access this data, has resulted in the emergence of Grid computing. In a Grid system, the resources of many computers, spanning geographic locations and organizations, are utilized to solve large-scale problems. These geographically distributed systems with loosely coupled jobs can require the management of an extremely large number of data sets. A Grid computing system for processing and managing this very large amount of distributed data is a Data Grid. Examples of Data Grids are the Biomedical Informatics Research Network (BIRN) [22], the Large Hadron Collider (LHC) [23] at the particle physics center CERN, the DataGrid project (EDG) [21] funded by the European Union, now known as the Enabling Grids for E-SciencE project (EGEE) [5], the International Virtual Observatory Alliance (IVOA) Grid community Research Group [24] and physics Data Grids [6], [12]. Data Grids require users to share both data and resources, and the management of such a large volume of data sets has posed a challenging problem in how to make the data more approachable and available to the users.

A common solution to improve availability and file access time in a Data Grid is to replicate the data. When data is replicated, copies of data files are created at many different sites in the Data Grid [3]. Deciding on where and when to make the data copies in the distributed nodes is a problem common to all data replication schemes for Data Grids. Earlier research on data replication [2], [4], [8], [11], [14], [16] focused on decreasing the data access latency and the network bandwidth assumption. As bandwidth and computing capacity have become relatively cheaper, the data access latency can drop dramatically, and how to improve the data availability and system reliability becomes the new focus.

The dynamic behavior of a Grid user, in combination with the large volume of datasets, makes it difficult to make decisions concerning data replication to meet the system availability goal [15]. In a Data Grid system, there are hundreds of clients across the globe who will submit their job requests, each of which will access multiple files to do some type of analysis. In data-intensive applications, when a job accesses a massive-size file, the unavailability of that file can cause the whole job to hang and the potential delay of the job can be unbounded. In large-scale data-intensive systems, hundreds of nodes will be involved and any node failure or network outage can cause potential file unavailability. As a result, there has been an increase in research focusing on how to maximize the file availability. Data replication strategies to improve the data availability have been proposed in [13], [15], but have assumed unlimited storage for replicas.

In this paper, we address the system-wide data availability problem assuming limited replica storage. We present two new data availability metrics, the System File Missing Rate and the System Bytes Missing Rate. (We are not aware of any other research at this time that utilizes any system data availability metrics.) We then model the problem in terms of an optimal solution in a static system. More importantly, for on-line processing of file replication, we propose a novel heuristic algorithm that maximizes the data availability by Minimizing the Data Missing Rate (MinDmr) for limited storage resources without sacrificing the data access latency. Based on MinDmr, we present four optimizers that are associated with four different prediction functions. Our test results on the popular OptorSim [9] show that our four MinDmr replica schemes perform better overall than the Binomial economical replica scheme, Zipf economical replica scheme [2], [4], LFU and LRU for the two metrics.

The rest of the paper is organized as follows. We describe related work in Grid systems in Section 2. Section 3 presents the two measures (the System File Missing Rate and the System Bytes Missing Rate), and discusses the system model. We present our analytical model and the dynamic replica algorithm in Section 4. In Section 5, we describe our simulation results based on the OptorSim, a simulator designed by the European Data Grid Project [5]. In the final section, we conclude our paper and describe future work.

Section snippets

Related work

Work on data availability in Grid systems initially focused on decreasing the data access latency and the network bandwidth assumption. In [14], the six replica strategies (No Replica, Best Client, Cascading Replication, Plain Caching, Caching plus Cascading Replica and Fast Spread) are simulated for the three user access patterns (random access, small temporal locality, and small geographical and temporal locality). The simulation results show that the best strategy has significant savings in

New metrics

While the previous measures of mean available bandwidth and percentage of computer usage are important to the overall functioning of an efficient Grid system, the Grid user is concerned with completing a job with correct data. Any data file access failure can lead to an incorrect result or a job crash. To increase the system reliability by protecting the user from such risk, the Grid system will be compelled to make the data availability as high as possible. Although some users can tolerate

Limited storage system availability algorithms

To maximize system data availability, we can always make enough copies of each file to make the SFMR and SBMR as small as possible by assuming there are infinite storage resources in the Grid (as the replica scheme described in [13]). However, in real life we cannot make such an assumption of infinite storage. Improving the SFMR or SBMR objective, subject to limited storage resources, is the key to achieving system data availability. In addition, we study the on-line replication optimization

Simulator and configuration

We evaluate the performance of our MinDmr replica and replacement strategy using the OptorSim v2.0.0, which was developed by the EU DataGrid Project [5] to test dynamic replica schemes. The designers of the OptorSim state that it was developed to “model the interactions of the individual Grid components of a running Data Grid as realistically as possible” [1]. As a result, we chose OptorSim in order to test our proposed strategies on the most currently available realistic testbed. OptorSim

Conclusions and future work

This paper studies the data availability problem for a Data Grid system with storage constraints. We have presented two metrics to evaluate the availability of the system data. We then discussed how we model the system availability problem and how to transfer this to a classic optimal problem. For on-line replication that can be close to the optimal solution, with the assumption that the Grid storage space is limited, we present our MinDmr replica greedy optimizer algorithm that treats hot and

Ming Lei is a Ph.D. candidate in Computer Science at The University of Alabama. He received a B.A. from Southwest Jiaotong University, China in 2000. His research interests include real-time database systems, networking security, high-performance grid computing and mobile data management.

References (24)

Peter Kunszt et al.
File-based replica management
Future Generation Computer Systems Journal
(2005)
Ming Tang et al.
The impact of data replication on job scheduling performance in the Data Grid
Future Generation Computer Systems Journal
(2006)
Ruay-Shiung Chang et al.
Complete and fragmented replica selection and retrieval in Data Grids
Future Generation Computer Systems Journal
(2007)
Ming Tang et al.
Dynamic replication algorithms for the multi-tier Data Grid
Future Generation Computer Systems Journal
(2005)
William H. Bell et al.
OptorSim—A grid simulator for studying dynamic data replication strategies
International Journal of High Performance Computing Applications
(2003)
William H. Bell et al.
Evaluation of an economy-based file replication strategy for a Data Grid
David G. Cameron et al.
Evaluating scheduling and replica optimisation strategies in OptorSim
Mark Carman et al.
Towards an economy-based optimisation of file access and replication on a Data Grid
EU Data Grid project....
GriPhyN: The Grid physics network project....

Ming Lei, S. Vrbsky, A data replication strategy to increase availability in Data Grids, in: Grid Computing and...

T.E. Ng, H. Zhang, Predicting internet network distance with coordinates-based approaches, in: 21st IEEE INFOCOM...

Cited by (87)

A new Prefetching-aware Data Replication to decrease access latency in cloud environment
2018, Journal of Systems and Software
Data replication is an effective technique that decreases retrieval time, thus reducing energy consumption in Cloud. When necessary files aren't locally available, they will be fetched from remote locations that is very high-time consuming process. Therefore, it is superior to pre-replicate the popular files. Even though few previous works considered prediction-based replication strategy, the prediction is not precise at many situations and occupies the storage. To address these challenges, a new dynamic replication strategy called Prefetching-aware Data Replication (PDR) is proposed, which determines the correlation of the data files using the file access history and pre-fetches the most popular files. So, the next time that this site requires a file, it will be locally available. In addition, due to the storage space restriction, replica replacement strategy plays a vital role. PDR strategy can ascertain the importance of valuable replicas based on the fuzzy inference system with four input parameters (i.e., number of accesses, cost of replica, the last time the replica was accessed, and data availability). Extensive experiments with CloudSim show that PDR achieves high data availability, high hit ratio, low storage and bandwidth consumption. On average PDR reduces over 35% of response time when compared to the other algorithms.
Storage tier-aware replicative data reorganization with prioritization for efficient workload processing
2018, Future Generation Computer Systems
The importance of data collection, processing, and analysis is rapidly growing. Big Data technologies are in high demand in many fields, including bio-informatics, hydrometeorology, and high energy physics. One of the most popular computational paradigms used in large data processing frameworks is the MapReduce programming model. Today, majority of integrated optimization mechanisms that quickly produce simple solutions typically consider only load balancing, which is not sufficient for advanced computations. Thus, more efficient and complex approaches are required. In this paper, we suggest an improved algorithm based on categories for reorganizing data in MapReduce frameworks and using replication as well as network transfer. Moreover, we introduce an algorithm customization for urgent computations which require specific approaches in terms of execution time and reliability. We also consider modern data storage aspects, like the ability to work with data on different “layers” (HDD, SSD, and RAM), which can greatly improve the overall performance of our solution.
A Systematic Literature Review of the Data Replication Techniques in the Cloud Environments
2017, Big Data Research
Citation Excerpt :
But, it has some drawbacks such as difficulty to collect runtime information of all the data nodes in a complex cloud infrastructure and hard to maintain consistency of data file [42]. Static and dynamic replication algorithms can be further classified into distributed [20,22,68,75] and centralized algorithms [14,38,66,71]. Static replication strategies follow deterministic policies, therefore, the number of replicas and the host node is well-defined and predetermined [42].
Cloud computing has various challenges, one of them is using copied data. Data replication is an important technique for distributed mass data management. The aim of the general idea of data replication is placing replications at different places, while there are several replications of a specific file at different points. Replication is one of the most broadly studied phenomena in the distributed environments in which multiple copies of some data are stored at multiple sites where overheads of creating, maintaining and updating the replicas are important and challenging issues. Applications and architecture of distributed computing have changed drastically during last decade and so has replication protocols. Different replication protocols may be suitable for different applications. However, despite the importance of this issue, in a cloud environment as a distributed environment, this issue has not been investigated so far systematically. The data replication in the cloud environment falls into two categories of static and dynamic methods. In the static patterns, a number of created replicas is constant and fixed from the beginning. The number is either determined by the user from the beginning or the cloud environment determines such number. However, in the dynamic algorithm and considering its environment, the number is determined based on user's access algorithm. The objective of this paper is to review the data replication techniques in these two main groups systematically as well as a discussing the main features of each group.
A dynamic, cost-aware, optimized data replication strategy for heterogeneous cloud data centers
2016, Future Generation Computer Systems
In cloud computing, it is important to maintain high data availability and the performance of the system. In order to meet these requirements, the concept of replication is used. As the number of replicas of a data file increases, the data availability and the performance also increases, but at the same time, the cost of creating and maintaining new replicas also increases. In order to enjoy the maximum benefits of replication, it is essential to optimize the cost of replication. The cloud systems are heterogeneous in nature as the different data centers have different policies, hardware and software configurations. As a result of this, the replicas of a data file placed at different data centers have different availabilities and replication costs associated with them. In this paper, a dynamic, cost-aware, optimized data replication strategy is proposed that identifies the minimum number of replicas required to ensure the desired availability. The concept of knapsack has been used to optimize the cost of replication and to re-replicate the replicas from higher-cost data centers to lower-cost data centers, without compromising the data availability. Mathematical descriptions and illustrations have been provided for the different phases of the proposed strategy, keeping in mind the heterogeneous nature of the system. The proposed strategy has been simulated using the CloudSim toolkit. The experimental results indicate that the strategy is effective in optimizing the cost of replication and increasing the data availability.
Data popularity measurements in distributed systems: Survey and design directions
2016, Journal of Network and Computer Applications
Citation Excerpt :
Also, Rahman et al. used this calculation method for several replication algorithms proposed in Rahman et al. (2005), Rahman et al. (2006), Rahman et al. (2008). Suri and Singh (2009), Lei and Vrbsky (2006), Lei et al. (2008), Bsoul et al. (2012), Wuqing et al. (2010) and many other works have used this simple measurement to evaluate the data popularity when designing their replication strategies. They indeed took advantage of its low calculation cost.
Distributed systems continue to be a promising area of research particularly in terms of providing efficient data access and maximum data availability for large-scale applications. For improving performances of distributed systems, several data replication strategies have been proposed to ensure reliability and data transfer speed as well as to offer the possibility to access the data efficiently from multiple locations. Data popularity is one of the most important parameters taken into consideration when designing data replication strategies. It assesses how much the data is requested by the sites of the system. In this paper, the importance of considering the data popularity parameter in replication management is highlighted. Different strategies are then identified and how they rely on the data popularity parameter is illustrated. Different calculation manners of data popularity are hence studied. This allows us to find out which factors are considered in order to assess data popularity. After classifying them into four categories, this work includes a critical discussion about each category. Some important directions for future work are then discussed towards possible solutions for a more effective data popularity assessment.
A comprehensive review of the data replication techniques in the cloud environments: Major trends and future directions
2016, Journal of Network and Computer Applications
Citation Excerpt :
Finally, these mechanisms are compared and summarized in Section 2.2.3. Dynamic strategies for data replication in cloud environments automatically create and delete the replicas according to changes in user access pattern, storage capacity and bandwidth (Chang and Chang, 2008; Doğan, 2009; Lei et al., 2008; Li et al., 2011; Wei et al., 2010). They make intelligent choices about the location of data depending upon the information of the current environment.
Nowadays, in various scientific domains, large data sets are becoming an important part of shared resources. Such huge mass of data is usually stored in cloud data centers. Therefore, data replication which is generally used to manage large volumes of data in a distributed manner speeds up data access, reduces access latency and increases data availability. However, despite the importance of the data replication techniques and mechanisms in cloud environments, there has not been a comprehensive study about reviewing and analyzing its important techniques systematically. Therefore, in this paper, the comprehensive and detailed study and survey of the state of art techniques and mechanisms in this field are provided. Also, we discuss the data replication mechanisms in the cloud systems and categorize them into two main groups including static and dynamic mechanisms. Static mechanisms of data replication determine the location of replication nodes during the design phase while dynamic ones select replication nodes at the run time. Furthermore, the taxonomy and comparison of the reviewed mechanisms are presented and their main features are highlighted. Finally, the related open issues and some hints to solve the challenges are mapped out. The review indicates that some dynamic approaches allow their associated replication strategies to be adjusted at run time according to changes in user behavior and network topology. Also, they are applicable for a service-oriented environment where the number and location of the users who intend to access data often have to be determined in a highly dynamic fashion.

View all citing articles on Scopus

Susan V. Vrbsky is an Associate Professor of Computer Science at The University of Alabama. Dr. Vrbsky received her Ph.D. in Computer Science from The University of Illinois, Urbana-Champaign. She received an M.S. in Computer Science from Southern Illinois University, Carbondale, IL and a B.A. from Northwestern University in Evanston, IL. Her research interests include real-time database systems, uncertainty and approximations, Data Grids and mobile data management.

Xiaoyan Hong is an Assistant Professor in the Department of Computer Science at The University of Alabama. She received her Ph.D. degree in Computer Science from The University of California at Los Angeles (UCLA) in 2003. Dr. Hong’s research is in the area of compute networks, covering mobile and wireless networks, vehicle networks, wireless sensor networks and grid computing networks. Her current research focuses on scalable routing, mobility modeling, privacy and security of wireless networks and resource allocation optimizations.

^☆: Earlier versions of this paper were presented at the Third International Workshop on Networks for Grid Applications GridNets 2006 held in San José, California, USA, October 1–2, 2006.

View full text

An on-line replication strategy to increase availability in Data Grids☆

Abstract

Introduction

Section snippets

Related work

New metrics

Limited storage system availability algorithms

Simulator and configuration

Conclusions and future work

Future Generation Computer Systems Journal

Future Generation Computer Systems Journal

Future Generation Computer Systems Journal

Future Generation Computer Systems Journal

OptorSim—A grid simulator for studying dynamic data replication strategies

International Journal of High Performance Computing Applications

Evaluation of an economy-based file replication strategy for a Data Grid

Evaluating scheduling and replica optimisation strategies in OptorSim

Towards an economy-based optimisation of file access and replication on a Data Grid