Introduction
-
Firstly, we propose a MapReduce framework that promotes parallel and distributed computing of shortest path in large-scale graph with A* algorithm.
-
Secondly, our experimental analysis proves that the MapReduce version of A* outperforms the direct resolution approach of A*, and significantly reduce the time complexity with high quality results. In addition, our framework is reliable on real road network and works well with the scalability of network size.
Related work
Sequential shortest paths computing
Distributed shortest paths computing
Background
Hadoop and MapReduce
A* algorithm
-
V is the finite set of vertices of the graph G.
-
E is the set of edges, such as: if \((v, u) \in E\), then there is an edge between the vertices v and u.
-
We define the length function \(l{:V} \times V \mapsto R^+\), which for each edge (v, u), we associate a length l(v, u) if there is an edge between v and u, else \(\infty \) if there is no edge.
-
For each vertex \(v \in V\), we define a distance \(d{:V}\mapsto R^+\) such as \(d(v)= \infty \) if we cannot reach the goal vertex from v.
-
Step 1 initializationset \(O = \emptyset \) and \(S = \emptyset \);begin by setting \(g(v) = \infty \) for each vertex \(v \in V\);next set current vertex \(c = s\), \(g(s) = 0\) and \(d(s) = h(s)\);finally set \(c = s\) and let \(S = \{s\}\);
-
Step 2 vertex expandingfor each vertex \(v \in V\) where edge \((c ,v) \in E\); if \(g(v) > g(c) + l(c, v)\) then update \(g(v) = g(c) + l(c, v)\); set \(d(v) = g(v) + h(v)\), set \(d(v) = g(v) + h(v)\) and when \(v \notin O\) let \(O = O+\{v\}\);
-
Step 3 selection of promising vertex \(v^*\)identify vertex \(v^*\in O\) where \(d(v^*) \le d(v)\) for all \(v \in O\); set \(O = O-\{v^*\}\) and \(S = S+\{v^*\}\);set \(c = v^*\);
-
Step 4 stopping criteriaif \(c = e\) then the path has been found;elseif \(O = \emptyset \) then failure;otherwise go to step 2.
Proposed MapReduce version of A*
Input stage: partition of the initial graph
-
l: subgraph length in km;
-
A: source vertex;
-
E: target vertex.
-
d: diagonal length in km;
-
i: ith subgraph;
-
(A.lon, A.lat): GPS coordinates of the source vertex (point A);
-
(E.lon, E.lat): GPS coordinates of the target vertex (point E).
Map stage: computation of intermediate paths
Reduce stage: concatenation of intermediate paths
Output Stage: Storage of full path
Experimental results
Parameter type | Parameter designation | Parameter value |
---|---|---|
\(N_{node}\)
| No. of cluster nodes | 6 |
n
| No. of graph vertices | [8000, 100,000] |
\(G_{size}\)
| Graph size in Gbit | [0.2, 16.6] |
\(B_{size}\)
| Block size in Mbit | {64, 128, 256} |
\(G'_{size}\)
| Subgraph size in Gbit | [0.3, 2] |
l
| Subgraph length in km | [20, 400] |
\(N^{map}_{core}\)
| No. of map cores | {1, 2, 3, 4} |
\(N^{red}_{core}\)
| No. of reduce cores | {1, 2, 3, 4} |
Data set
Application: road trip from northern to southern Morocco
Subgraph | Starting point | Ending point | ||||
---|---|---|---|---|---|---|
Name | Lat | lon | Name | Lat | lon | |
1 | Tangier | 35.759 | − 5.818 | El Gara | 33.24 | − 7.15 |
2 | El Gara | 33.24 | − 7.15 | Ait M’Hamed | 31.857 | − 6.498 |
3 | Ait M’Hamed | 31.857 | − 6.498 | Ikiafene | 29.663 | − 9.638 |
4 | Ikiafene | 29.663 | − 9.638 | Boukraa | 26.341 | − 12.841 |
5 | Boukraa | 26.341 | − 12.841 | Dahkla | 23.03 | − 15.02 |
Ratio of MapReduce-A* efficiency versus direct resolution
n
| No. data |
\(G_{size}\)
| Time with A* | Time with MRA* | Ratio | ||
---|---|---|---|---|---|---|---|
\(T_{A^*}\)
|
\(t_{map}\)
|
\(t_{red}\)
|
\(T_{MRA^*}\)
|
\(\frac{T_{A^*_{ }}}{T_{MRA^*}}\)
| |||
8000 | 64 × 106 | 0.25 | 1,5752 | 157 | 11 | 168 | 94 |
15,000 | 22.5 × 107 | 0.5 | 35,331 | 183 | 18 | 201 | 159 |
20,000 | 40 × 107 | 0.8 | 51,469 | 233 | 36 | 269 | 191 |
25,000 | 62.5 × 107 | 1.2 | 77,378 | 330 | 53 | 366 | 211 |
30,000 | 90 × 107 | 2 | 101,909 | 405 | 76 | 481 | 212 |
40,000 | 16 × 108 | 3 | 129,343 | 698 | 91 | 789 | 164 |
50,000 | 25 × 108 | 5 | 152,863 | 1035 | 103 | 1137 | 134 |
60,000 | 36 × 108 | 7 | 177,948 | 1397 | 203 | 1600 | 111 |
80,000 | 64 × 108 | 11.5 | 269,102 | 1790 | 408 | 2198 | 122 |
90,000 | 81 × 108 | 14 | 31,4425 | 2100 | 568 | 2668 | 118 |
100,000 | 10 × 109 | 16.6 | 359,747 | 2257 | 682 | 2939 | 122 |
Average ratio of time improvement |
149
|
Influence of number of core processors on the computational time
Influence of number of Hadoop nodes on the computational time
Influence of number of subgraphs on the computational time and the result quality
Influence of blocks size and subgraphs length on the computational time
l
|
\(G'_{size}\)
| Total time \(T_{MRA^*}\) | ||
---|---|---|---|---|
\(B_{size} = 64\)
|
\(B_{size} = 128\)
|
\(B_{size} = 256\)
| ||
60 | 0.3 | 371 | 376 | 368 |
78 | 0.4 | 412 | 409 | 409 |
100 | 0.5 | 463 | 464 | 464 |
109 | 0.6 | 460 | 517 | 532 |
125 | 0.7 | 458 | 596 | 605 |
156 | 0.8 | 459 | 753 | 754 |
200 | 1 | 461 | 1003 | 1065 |
265 | 1.3 | 460 | 1008 | 1149 |
400 | 2 | 461 | 1004 | 1164 |
Average time |
445
|
681
|
726
|
MapReduce-A* versus MapReduce-Dijkstra
Discussion
Conclusion and further works
-
Adapting the traveling salesman problem to such framework.
-
Proposing a novel framework based on Big Data graph analysis tools such as Neo4j or Apache Shindig to better explore the road network graph.