1 Introduction

Data allocation problem (DAP) in distributed database systems is a NP-hard optimization problem with great importance in distributed environments. In DAP problem the main objective is to assign set of fragments to set of sites in order to minimize the total cost of transactions. The main cost in the system comes from data transmission across the distributed system. Therefore, the fragment allocation affects and increases the overall cost [1]. In real world, it is obvious that huge amount of data is used by different types of sites e.g. mail servers and search engines [2]. Hence, the way of data allocation to the sites is of remarkable importance in reducing the total cost. Detailed description of DAP problem is presented in next sections.

The optimization problems are significant challenge in majority of engineering applications in which researchers have been extensively solving DAP as challenging problem by applying soft computing methods especially evolutionary algorithms due to their capability in extracting good solutions in an acceptable computational time. Two types of algorithms namely static and dynamic algorithms have been applied to solve DAP problem [3, 4]. The literature review and state-of-the-art in solving DAP problem is given in following sections.

Recently, the hybrid methods have been proposed for solving optimization problems because they increase the search efficiency by combining the abilities of different methods [5,6,7]. This study proposes an innovative hybrid method, named as DEVNS from this point on, for solving data allocation problem (DAP). The suggested method consists of differential evolution (DE) algorithm [8,9,10] with efficient operators and variable neighborhood search technique (VNS) [11, 12].

Differential evolution (DE) is a population-based algorithm for solving optimization problems. DE attempts to improve candidate solutions iteratively regarding to their fitness values. These kinds of algorithms are called as metaheuristics in which they are able to discover huge search spaces in acceptable computational time. Also, they don’t need much information about the problem being solved. DE like all other metaheuristics does not guarantee to find optimal solution but it hopefully can find feasible and good solutions for optimization problems. DE works on population of solutions and generates new solutions by combining solutions selected from population. The new generated solutions are evaluated in terms of fitness to decide on adding them to the population. If new solutions have higher quality, they are added to the population, otherwise they are discarded. This way, all solutions in population moves over the search space to discover better positions than previous positions [8,9,10].

Variable neighborhood search (VNS) is also a single-solution based metaheuristic for solving optimization problems. VNS discovers neighbors of candidate solution in far distance over search space then carry out local search around them. In case of finding better solution in terms of fitness, VNS moves there and forgets the previous solution. VNS is used to solve discrete and continuous type of linear and non-line optimization problems [11, 12].

According to the proposed hybrid strategy, DE algorithm is modified with more efficient selection and crossover operators. Likewise, an efficient neighborhood search mechanism is applied to a solution instead of mutation operator in each iteration. In the modified DE, instead of applying fully random-based selection operator, the solutions are selected still randomly but based on their fitness using roulette wheel selection approach. This way, the better solutions in terms of fitness will have more chance to be selected. However, the roulette wheel selection (RWS) operator will give a small chance for worse solutions to be selected. In selection phase, for each solution in population 3 solutions are chosen using RWS. Later on a solution is generated using crossover operator applied on selected solutions. Thereafter VNS algorithm is applied on generated solution and if it becomes better than solution in population, replacement is done.

Variable neighborhood search (VNS) technique is applied over generated solution in each iterations to do more exploitation. This way, generated solution is adjusted and becomes more accurate. The flowchart for VNS search method is presented in following sections. In order to prevent time consuming VNS, the inner loop is iterated 10 times. Neighborhood structure N in VNS method is defined in a way that it makes somewhat big modification on the solution. For that, S sites are selected and modified randomly. The details of operators and VNS method are given in the following sections.

Performance of proposed DEVNS is experimentally evaluated against nine state-of-the-art methods using well-known benchmarks reported in literature. Obtained results exhibits that proposed DEVNS takes the first position for 13 of 20 test problems. Likewise, Friedman aligned rank test is carried out to demonstrate that there is significant statistical difference between all methods.

The rest of the paper is formed as following: The detailed description of Data Allocation problem is given in Sect. 2. Section 3 presents the state-of-the-art methods for solving the DAP problem. The proposed novel hybrid method (DEVNS) is expressed in Sect. 4. Section 5 demonstrates the algorithm parameters, statistical analysis and comparison results. Finally, the conclusions and some future research works are given in Sect. 6.

2 Data allocation problem in distributed database systems

This section describes the data allocation problem (DAP) in distributed database systems in details. In a distributed database system including a number of sites, each site is responsible to support a part of database [13]. In such a system, a big amount of data transmission between the sites is needed due to having many transactions with diverse frequencies submitted to the sites. In DAP problem, the objective is to minimize the completion time and cost of transactions which is affected mostly by required time for data transmission due to site-fragment and inter-fragment relations [13]. Table 1 represents the description of parameter notations used in DAP problem [13]. More description about parameters can be found in [13].

Table 1 Notation description [13]

The transaction-fragment and site-transaction dependencies are indicated in Fig. 1.

Fig. 1
figure 1

Transaction-fragment and site-transaction dependencies

As already mentioned, the objective in DAP problem is to minimize the total cost. As problem constraints, the storage capacity of sites and as problem cost, the data transmission cost is taken into account. Data transmission through distributed system puts much pressure upon the system and increases the cost. Variable \(X_{ij}\) is declared as follows [13]:

$$X_{ij} = \left\{ {\begin{array}{*{20}l} 1 \hfill & {\,if\,\,\Psi_{j} = s_{i} } \\ 0 \hfill & {otherwise} \\ \end{array} } \right.$$
(1)

where \(\Psi_{j}\) is the site which \(f_{j}\) is assigned. Then the storage capacity limitation is defined as [13]:

$$\sum\limits_{j = 1}^{m} {\,fragSize_{j} \, \times \,X_{ij} \, \le \,siteCap_{i} \quad i = 1\,\,,\, \ldots ,\,\,n}$$
(2)

The total cost of data transmission is calculated as follows [13]:

$$COST\,(\Psi )\, = \,\sum\limits_{j = 1}^{m} {partialco\,st1_{{\Psi_{j} j}} \,\, + \,\,\,\sum\limits_{j1 = 1}^{m} {\,\sum\limits_{j2 = 1}^{m} {\,frdep_{j1j2} \, \times \,\,uc_{{\,\Psi_{j1} \,\Psi_{j2} }} } } \,\,}$$
(3)

where in the first part, \(partialco\,st1_{{\Psi_{j} j}}\) is the cost of storing \(f_{j}\) on site \(s_{{\Psi_{j} }}\) which is calculated as Eq. 4 [13].

$$partialco\,\,st1_{{\Psi_{j} \,j}} \, = \,\sum\limits_{q = 1}^{n} {uc_{{\,\Psi_{j} \,q}} \, \times \,stfr_{qj} }$$
(4)

where \(stfr_{qj}\) is amount of data from \(f_{j}\) accessed by \(s_{q}\) which is computed by Eq. 5 [13].

$$stfr_{qj} \, = \,\,\sum\limits_{k = 1}^{l} {freq_{qk} \, \times \,\,trfr_{kj} }$$
(5)

Also in the second part of \(COST \,(\Psi )\), the value of \(frdep_{j1j2}\) is amount of data from site of \(f_{j1}\) to the site of \(f_{j2}\) taking the indirect dependencies into account.

The direct transaction-fragment dependency (TRFR) is a matrix in which for each execution of \(t_{k}\), the value of \(trfr_{kj}\) presents amount of data to be sent from site including \(f_{j}\) to the site including \(t_{k}\). The dependency is called as direct dependency if for each execution of \(t_{k}\) there is some data sent from site of \(f_{j}\) to site of \(t_{k}\).

The value of \(frdep_{j1j2}\) is calculated as follows [13]:

$$frdep_{j1j2} = \sum\limits_{k = 1}^{l} {qfr_{kj1j2} }$$
(6)

where \(qfr_{kj1j2}\) is amount of data sending from site of \(f_{j1}\) to the site of \(f_{j2}\) which is computed as Eq. 7 [13].

$$qfr_{kj1j2} \, = \,q_{kj1j2} \, \times \,\sum\limits_{r = 1}^{n} {freq_{kr} }$$
(7)

Section 4 explains our proposed hybrid method (DEVNS) for solving data allocation problem (DAP) using the cost function mentioned in Eq. 3.

3 State-of-the-art in methods for solving data allocation problem in distributed systems

Evolutionary methods including different metaheuristics have been extensively used to solve data allocation (DAP) problem. It can be found a pretty large literature in this field. This section reviews some significant state-of-the-art in methods for solving DAP problem.

In [14], the DAP problem was modeled as knapsack problem and then solved using Artificial Immune system method [15]. Authors compared their results to other existent methods to indicate the effects of memory capacity.

Singh et al. [16] proposed an innovative method for solving fragment allocation problem in distributed systems by applying biogeography-based optimization (BBO) approach. Authors used their suggested method to minimize the total transmission and storage costs and compared obtained results to genetic algorithm results. The evaluation results represent the robustness of the proposed method in [16] compared to genetic algorithm.

Sen et al. [17] solved DAP problem by applying simulated annealing (SA) algorithm [18] and evaluated their results using standard benchmarks in CPLEX. As evaluation result, they showed that SA algorithm works better in reasonable time.

Tosun [19], applied genetic algorithm (GA), ant colony optimization (ACO) [20] and Tabu-search method [21] to solve data allocation problem. The comparison results exhibited the robustness of suggested methods using QAPLIB benchmark library. Likewise, the author proposed a new Crossover method in his research.

A novel data allocation model in constraint distributed database systems was introduced in [22] by Abdalla. The suggested method allocates fragments to the sites according to communication and update costs for the fragments. Obtained results showed that the suggested method is more successful compared to state-of-the-art methods.

Adl et al. [13] suggested a method based on ACO and local search for solving DAP problem. The obtained results indicated that proposed method reaches to near-optimal solutions in acceptable computational time. The results also proved that proposed method is scalable and flexible.

Ahmad et al. [23] introduced a dependency graph for modeling the dependency between the fragments. The authors designed an evolutionary approach for data allocation in distributed database systems. The proposed method was evaluated over standard benchmarks to indicate the robustness of method.

A novel method based on particle swarm optimization (PSO) was proposed by Mahi et al. [2] for solving data allocation problem (DAP). The proposed method is appropriate for minimizing the total transmission cost. Suggested method, PSO-DAP, was evaluated over 20 different benchmarks and compared to existent methods in literature. Obtained results confirmed that PSO-DAP extracts solutions with higher quality rather than its competitors.

Bhuyar et al. [24] presented an introduction to distributed database systems in terms of fragmentation and allocation. According to [24], data fragmentation and allocation are two important issues in distributed database which are NP-hard problems. Authors mentioned that fragment allocation is critical issue for improving the performance of applications in distributed database systems.

In Sewisy et al. [25] developed a heuristic query-driven clustering-based vertical fragmentation technique which focuses on generates clusters. Later on they provide intended disjoint fragments using clusters. Experimental results demonstrated that proposed method is effective and valid.

Abdalla et al. [26] aimed to enhance transmission cost to improve distribution performance. Authors improved data fragmentation and allocation methods as well as they carried out site clustering to produce minimum possible number of clusters. The performance of proposed method was evaluated using TC objective function and results proved the efficiency of suggested method.

In Apers et al. [27] proposed a method to minimize total transmission cost in which it carry out fragment allocation. Likewise, authors investigated the complexity of data allocation problem. Also they presented and compared methods to achieve optimal solutions.

4 The proposed hybrid method (DEVNS) for solving data allocation problem (DAP) in distributed database systems

This section describes the proposed innovative hybrid method (DEVNS) comprising modified differential evolution (DE) algorithm [8,9,10] and variable neighborhood search (VNS) technique [11, 12] for solving data allocation problem (DAP). The proposed DEVNS uses a modified DE in terms of selection, crossover and mutation operators. Figure 2 demonstrates the flowchart of proposed DEVNS for solving DAP problem.

Fig. 2
figure 2

Proposed hybrid method (DEVNS)

The flowchart in Fig. 2 starts with initialization of population by random. In the proposed DEVNS, a solution (allocation) is represented as a one dimensional array with m columns, where m is the number of fragments in distributed database system. In this representation, fragments and sites are shown by array indexes and array contents respectively. For instance, a sample allocation for 20 fragments and 4 sites is represented in Fig. 3.

Fig. 3
figure 3

A sample solution representation for 20 fragments and 4 sites

In the next step, population is initialized randomly using the algorithm shown in Fig. 4. Population is a two dimensional array in which the number of solutions and number of fragments are PopSize and m respectively. Population initialization algorithm in Fig. 4 fills this array with numbers between 1 and number of sites.

Fig. 4
figure 4

Population initialization

Later on, the system works in consecutive sessions until the termination criteria are met. In the first step of session, the system calculates the total cost values according to the descriptions and equations given in Sect. 2. The algorithm for computing the total cost value for a solution is given in Fig. 5.

Fig. 5
figure 5

Total cost value calculation

Afterwards, the hybrid system (DEVNS) has an inner loop to do the exploration task using differential evolution algorithm. The loop repeats the following steps for each solution i in the population. In the first step, three different solutions are selected using roulette wheel selection by taking the total cost values into account. The algorithm is given in Fig. 2. Thereafter in three consecutive steps the crossover operator is carried out over randomly selected solutions to find out the final solution C. Crossover operator is carried out in a way that it generates feasible solutions. For this, the two-point crossover is applied to combine the sites allocated to fragments. Figure 6 indicates the crossover algorithm.

Fig. 6
figure 6

Two-point crossover

Afterwards, variable neighborhood search (VNS) technique is applied over solution C to do more exploration and exploitation. This way, solution C is adjusted and becomes more accurate. Figure 7 indicates the flowchart for VNS search method. In order to prevent time consuming VNS, the inner loop is iterated 10 times. The neighborhood structure in VNS method is defined in a way that it makes somehow big modification on the solution. To do that, more sites are changed randomly to discover new solutions in far distance over search space. This way, it would be possible to extract higher quality solutions through whole space.

Fig. 7
figure 7

Variable neighborhood search method (VNS)

Once the inner loop is terminated, DEVNS continues with the next session if the termination criteria are not met.

5 Evaluation and experimental results

This section demonstrates the evaluation results of applying proposed DEVNS over well-known benchmarks reported in state-of-the-art literature [1, 2]. The proposed method is tested using the same data set used by state-of-the-art methods. Benchmark data sets are generated using the rules and equations defined in Sect. 2 and according to experimental environment described in [13]. It should be noticed that unit cost is chosen in rage of [0, 1]. Table 2 presents the values of all parameters related to the DEVNS [2, 13]. DEVNS has been implemented in Matlab® programming language environment and executed on computer with CPU 2.00 GHz, memory 2 GB and 32-bit Windows 10 operating system. In the evaluation process, the number of generations is taken into account less than values reported in related literature [1, 2]

Table 2 Parameter values used in proposed DEVNS

According to state-of-the-art literature [1, 2], the number of fragments and sites are assumed to be equal by all algorithms. The state-of-the-art literature to be compared to proposed DEVNS are PSO-DAP [2], SA [28], ACO [1], RTS [1], GA [1] and HG-MTS [1]. Table 3 represents the cost comparison of methods over different DAP instance sizes. Results of best performing algorithms are indicated in bold.

Table 3 Cost comparison of methods for different DAP instance sizes (cost value is column × \(10^{6}\))

It can be seen over Table 3 that only PSO-DAP and HG-MTS methods are better than DEVNS in 2 and 5 of 20 problem instances respectively. The proposed DEVNS takes the first position for 13 of 20 test problems. Likewise, comparisons between PSO-DAP and DEVNS in the Table 3 indicates that mostly for the very small problem sizes, the PSO-DAP performs better, but the difference between obtained results is very small. When the problem size is growing, DEVNS performs much better than all competitors and PSO-DAP.

The last step of experimental evaluations is to perform the friedman aligned rank test. The rank test is performed over the all cost values obtained by all 9 methods presented in Table 3 and has two goals. The first goal is to check the similarity of DEVNS results to other 8 methods results. The second objective is to determine the order of DEVNS among all 9 methods. Friedman aligned rank test is performed according to the computational procedure explained in [29, 30]. Firedman aligned ranks for all 9 methods are presented in Table 4. This table indicates the ranks for all problem-method pairs.

Table 4 Friedman aligned ranks for all problem-method pairs

The friedman aligned rank test statistic, FAR, is computed according to the formula described in [29] with statistical significance of \(\chi^{2}\) distribution and k-1 degrees of freedom where k is the number of methods. Meanwhile, the computed p value illustrates the significant differences between all methods under consideration. Eventually, the Table 5 indicates the average rank values, FAR and p value for all methods. It can be seen that average rank value for DEVNS is the smallest one which shows that DEVNS is the best performing method. For the same reason, the PSO-DAP is the second best performing method. Likewise, the p value is very close to zero that demonstrates that there is significant statistical difference between all methods. Meanwhile, a very small p value means that DEVNS is statistically different than other 8 methods.

Table 5 Average Friedman aligned ranks, FAR and p value

6 Conclusions and future works

This paper proposes an innovative hybrid method (DEVNS) for solving data allocation problem in distributed database systems. The method works based on a strategy to combine differential evolution (DE) algorithm and variable neighborhood search (VNS) technique. To have an effective method, DE method is modified with effective selection, crossover and mutation operators. Selection operator is carried out based on objective values instead of random selections to increase the chance of better solutions to be selected. Moreover, the variable neighborhood search technique is added to the algorithm instead of mutation operator to provide more exploration and exploitation power and extract more accurate solutions. Obtained results demonstrate that proposed hybrid method outperforms all state-of-the-art methods reported in literature. Future works are planned to use the proposed method with better strategies over different problems.