Skip to main content
Top
Published in: Cluster Computing 3/2015

01-09-2015

Large scale graph processing systems: survey and an experimental evaluation

Authors: Omar Batarfi, Radwa El Shawi, Ayman G. Fayoumi, Reza Nouri, Seyed-Mehdi-Reza Beheshti, Ahmed Barnawi, Sherif Sakr

Published in: Cluster Computing | Issue 3/2015

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

Graph is a fundamental data structure that captures relationships between different data entities. In practice, graphs are widely used for modeling complicated data in different application domains such as social networks, protein networks, transportation networks, bibliographical networks, knowledge bases and many more. Currently, graphs with millions and billions of nodes and edges have become very common. In principle, graph analytics is an important big data discovery technique. Therefore, with the increasing abundance of large graphs, designing scalable systems for processing and analyzing large scale graphs has become one of the most timely problems facing the big data research community. In general, scalable processing of big graphs is a challenging task due to their size and the inherent irregular structure of graph computations. Thus, in recent years, we have witnessed an unprecedented interest in building big graph processing systems that attempted to tackle these challenges. In this article, we provide a comprehensive survey over the state-of-the-art of large scale graph processing platforms. In addition, we present an extensive experimental study of five popular systems in this domain, namely, GraphChi, Apache Giraph, GPS, GraphLab and GraphX. In particular, we report and analyze the performance characteristics of these systems using five common graph processing algorithms and seven large graph datasets. Finally, we identify a set of the current open research challenges and discuss some promising directions for future research in the domain of large scale graph processing.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Footnotes
28
Please refer to a table with specification of EC2 instances on http://​aws.​amazon.​com/​ec2/​instance-types/​.
 
Literature
1.
go back to reference Abouzeid, A., Bajda-Pawlikowski, K., Abadi, D.J., Rasin, A., Silberschatz, A.: HadoopDB: an architectural hybrid of MapReduce and DBMS technologies for analytical workloads. PVLDB 2(1), 922–933 (2009) Abouzeid, A., Bajda-Pawlikowski, K., Abadi, D.J., Rasin, A., Silberschatz, A.: HadoopDB: an architectural hybrid of MapReduce and DBMS technologies for analytical workloads. PVLDB 2(1), 922–933 (2009)
2.
go back to reference Alexandrov, A., Bergmann, R., Ewen, S., Freytag, J., Hueske, F., Heise, A., Kao, O., Leich, M., Leser, U., Markl, V., Naumann, F., Peters, M., Rheinländer, A., Sax, M.J., Schelter, S., Höger, M., Tzoumas, K., Warneke, D.: The stratosphere platform for big data analytics. VLDB J. 23(6), 939–964 (2014)CrossRef Alexandrov, A., Bergmann, R., Ewen, S., Freytag, J., Hueske, F., Heise, A., Kao, O., Leich, M., Leser, U., Markl, V., Naumann, F., Peters, M., Rheinländer, A., Sax, M.J., Schelter, S., Höger, M., Tzoumas, K., Warneke, D.: The stratosphere platform for big data analytics. VLDB J. 23(6), 939–964 (2014)CrossRef
3.
go back to reference Barnawi, A., Batarfi, O., Elshawi, R., Fayoumi, A., Nouri, R., Sakr, S.: On characterizing the performance of distributed graph computation platforms. In: Proceedings of the TPC Technology Conference, TPCTC. Springer, Berlin (2014) Barnawi, A., Batarfi, O., Elshawi, R., Fayoumi, A., Nouri, R., Sakr, S.: On characterizing the performance of distributed graph computation platforms. In: Proceedings of the TPC Technology Conference, TPCTC. Springer, Berlin (2014)
4.
go back to reference Borkar, V.R., Carey, M.J., Grover, R., Onose, N., Vernica, R.: Hyracks: a flexible and extensible foundation for data-intensive computing. In: Proceedings of the international conference on Data Engineering, ICDE, pp. 1151–1162. IEEE (2011) Borkar, V.R., Carey, M.J., Grover, R., Onose, N., Vernica, R.: Hyracks: a flexible and extensible foundation for data-intensive computing. In: Proceedings of the international conference on Data Engineering, ICDE, pp. 1151–1162. IEEE (2011)
5.
go back to reference Bu, Y., Howe, B., Balazinska, M., Ernst, M.D.: The HaLoop approach to large-scale iterative data analysis. VLDB J. 21(2), 169–190 (2012)CrossRef Bu, Y., Howe, B., Balazinska, M., Ernst, M.D.: The HaLoop approach to large-scale iterative data analysis. VLDB J. 21(2), 169–190 (2012)CrossRef
6.
go back to reference Bu, Y., Borkar, V.R., Jia, J., Carey, M.J., Condie, T.: Pregelix: Big(ger) graph analytics on a dataflow engine. PVLDB 8(2), 161–172 (2014) Bu, Y., Borkar, V.R., Jia, J., Carey, M.J., Condie, T.: Pregelix: Big(ger) graph analytics on a dataflow engine. PVLDB 8(2), 161–172 (2014)
7.
go back to reference Chattopadhyay, B., Lin, L., Liu, W., Mittal, S., Aragonda, P., Lychagina, V., Kwon, Y., Wong, M.: Tenzing a SQL implementation on the MapReduce framework. PVLDB 4(12), 1318–1327 (2011) Chattopadhyay, B., Lin, L., Liu, W., Mittal, S., Aragonda, P., Lychagina, V., Kwon, Y., Wong, M.: Tenzing a SQL implementation on the MapReduce framework. PVLDB 4(12), 1318–1327 (2011)
8.
go back to reference Chen, R., Weng, X., He, B., Yang, M.: Large graph processing in the cloud. In: Proceedings of the SIGMOD, pp. 1123–1126. ACM (2010) Chen, R., Weng, X., He, B., Yang, M.: Large graph processing in the cloud. In: Proceedings of the SIGMOD, pp. 1123–1126. ACM (2010)
9.
go back to reference Clinger, W.D.: Foundations of Actor Semantics. Technical Report, Cambridge, MA (1981) Clinger, W.D.: Foundations of Actor Semantics. Technical Report, Cambridge, MA (1981)
10.
go back to reference Dean, J., Ghemawa, S.: MapReduce: simplified data processing on large clusters. OSDI 1, 137–150 (2004) Dean, J., Ghemawa, S.: MapReduce: simplified data processing on large clusters. OSDI 1, 137–150 (2004)
11.
go back to reference Ediger, D., Bader, D.A.: Investigating graph algorithms in the BSP model on the cray XMT. In: Proceedings of the IPDPS workshops (2013) Ediger, D., Bader, D.A.: Investigating graph algorithms in the BSP model on the cray XMT. In: Proceedings of the IPDPS workshops (2013)
12.
go back to reference Ekanayake, J., Li, H., Zhang, B., Gunarathne, T., Bae, S.-H., Qiu, J., Fox, G.: Twister: a runtime for iterative MapReduce. In: Proceedings of the High Performance Distributed Computing, HPDC, pp. 810–818. ACM (2010) Ekanayake, J., Li, H., Zhang, B., Gunarathne, T., Bae, S.-H., Qiu, J., Fox, G.: Twister: a runtime for iterative MapReduce. In: Proceedings of the High Performance Distributed Computing, HPDC, pp. 810–818. ACM (2010)
13.
go back to reference Fard, A., Nisar, M.U., Ramaswamy, L., Miller, J.A., Saltz, M.: A distributed vertex-centric approach for pattern matching in massive graphs. In: Proceedings of the BigData conference, pp. 403–411 (2013) Fard, A., Nisar, M.U., Ramaswamy, L., Miller, J.A., Saltz, M.: A distributed vertex-centric approach for pattern matching in massive graphs. In: Proceedings of the BigData conference, pp. 403–411 (2013)
14.
go back to reference Friedman, E., Pawlowski, P.M., Cieslewicz, J.: SQL/MapReduce: a practical approach to self-describing, polymorphic, and parallelizable user-defined functions. PVLDB 2(2), 1402–1413 (2009) Friedman, E., Pawlowski, P.M., Cieslewicz, J.: SQL/MapReduce: a practical approach to self-describing, polymorphic, and parallelizable user-defined functions. PVLDB 2(2), 1402–1413 (2009)
15.
go back to reference Gonzalez, J.E., Low, Y., Gu, H., Bickson, D., Guestrin, C.: PowerGraph: distributed graph-parallel computation on natural graphs. In: Proceedings of the Operating Systems Design and Implementation, OSDI, pp. 17–30 (2012) Gonzalez, J.E., Low, Y., Gu, H., Bickson, D., Guestrin, C.: PowerGraph: distributed graph-parallel computation on natural graphs. In: Proceedings of the Operating Systems Design and Implementation, OSDI, pp. 17–30 (2012)
16.
go back to reference Gonzalez, J.E., Xin, R.S., Dave, A., Crankshaw, D., Franklin, M.J., Stoica, I.: GraphX: graph processing in a distributed dataflow framework. In: Proceedings of the OSDI, pp. 599–613 (2014) Gonzalez, J.E., Xin, R.S., Dave, A., Crankshaw, D., Franklin, M.J., Stoica, I.: GraphX: graph processing in a distributed dataflow framework. In: Proceedings of the OSDI, pp. 599–613 (2014)
17.
go back to reference Guo, Y., Biczak, M., Varbanescu, A.L., Iosup, A., Martella, C., Willke, T.L.: How well do graph-processing platforms perform? An empirical performance evaluation and analysis. In: Proceedings of the International Parallel and Distributed Processing Symposiumm, IPDPS, pp. 395–404 (2014) Guo, Y., Biczak, M., Varbanescu, A.L., Iosup, A., Martella, C., Willke, T.L.: How well do graph-processing platforms perform? An empirical performance evaluation and analysis. In: Proceedings of the International Parallel and Distributed Processing Symposiumm, IPDPS, pp. 395–404 (2014)
18.
go back to reference Guo, Y., Varbanescu, A.L., Iosup, A., Martella, C., Willke, T.L.: Benchmarking graph-processing platforms: a vision. In: Proceedings of the International Conference on Performance Engineering, ICPE, pp. 289–292 (2014) Guo, Y., Varbanescu, A.L., Iosup, A., Martella, C., Willke, T.L.: Benchmarking graph-processing platforms: a vision. In: Proceedings of the International Conference on Performance Engineering, ICPE, pp. 289–292 (2014)
19.
go back to reference Han, W., Lee, S., Park, K., Lee, J., Kim, M., Kim, J., Yu, H.: TurboGraph: a fast parallel graph engine handling billion-scale graphs in a single PC. In: Proceedings of the KDD, pp. 77–85 (2013) Han, W., Lee, S., Park, K., Lee, J., Kim, M., Kim, J., Yu, H.: TurboGraph: a fast parallel graph engine handling billion-scale graphs in a single PC. In: Proceedings of the KDD, pp. 77–85 (2013)
20.
go back to reference Han, M., Daudjee, K., Ammar, K., Özsu, M.T., Wang, X., Jin, T.: An experimental comparison of Pregel-like graph processing systems. PVLDB 7(12), 1047–1058 (2014) Han, M., Daudjee, K., Ammar, K., Özsu, M.T., Wang, X., Jin, T.: An experimental comparison of Pregel-like graph processing systems. PVLDB 7(12), 1047–1058 (2014)
21.
go back to reference Herodotou, H., Lim, H., Luo, G., Borisov, N., Dong, L., Cetin, F.B., Babu, S.: Starfish: a Self-tuning system for big data analytics. In: Proceedings of the Conference on Innovative Data Systems Research, CIDR, pp. 261–272 (2011) Herodotou, H., Lim, H., Luo, G., Borisov, N., Dong, L., Cetin, F.B., Babu, S.: Starfish: a Self-tuning system for big data analytics. In: Proceedings of the Conference on Innovative Data Systems Research, CIDR, pp. 261–272 (2011)
22.
go back to reference Kang, U., Tsourakakis, C.E., Faloutsos, C.: PEGASUS: a peta-scale graph mining system. In: Proceedings of the International Conference on Data Mining, ICDM, pp. 229–238 (2009) Kang, U., Tsourakakis, C.E., Faloutsos, C.: PEGASUS: a peta-scale graph mining system. In: Proceedings of the International Conference on Data Mining, ICDM, pp. 229–238 (2009)
23.
go back to reference Kang, U., Meeder, B., Faloutsos, C.: Spectral analysis for billion-scale graphs: discoveries and implementation. In: Proceedings of the PAKDD, pp. 13–25 (2011) Kang, U., Meeder, B., Faloutsos, C.: Spectral analysis for billion-scale graphs: discoveries and implementation. In: Proceedings of the PAKDD, pp. 13–25 (2011)
24.
go back to reference Kang, U., Tsourakakis, C.E., Faloutsos, C.: PEGASUS: mining peta-scale graphs. Knowl. Inf. Syst. 27(2), 303–325 (2011)CrossRef Kang, U., Tsourakakis, C.E., Faloutsos, C.: PEGASUS: mining peta-scale graphs. Knowl. Inf. Syst. 27(2), 303–325 (2011)CrossRef
25.
go back to reference Kang, U., Tong, H., Sun, J., Lin, C.-Y., Faloutsos, C.: GBASE: a scalable and general graph management system. In: Proceedings of the international conference on Knowledge Discovery and Data Mining, KDD, pp. 1091–1099 (2011) Kang, U., Tong, H., Sun, J., Lin, C.-Y., Faloutsos, C.: GBASE: a scalable and general graph management system. In: Proceedings of the international conference on Knowledge Discovery and Data Mining, KDD, pp. 1091–1099 (2011)
26.
go back to reference Khayyat, Z., Awara, K., Alonazi, A., Jamjoom, H., Williams, D., Kalnis, P.: Mizan: a system for dynamic load balancing in large-scale graph processing. In: Proceedings of the European Conference on Computer Systems, EuroSys, pp. 169–182. ACM (2013) Khayyat, Z., Awara, K., Alonazi, A., Jamjoom, H., Williams, D., Kalnis, P.: Mizan: a system for dynamic load balancing in large-scale graph processing. In: Proceedings of the European Conference on Computer Systems, EuroSys, pp. 169–182. ACM (2013)
27.
go back to reference Kyrola, A., Blelloch, G.E., Guestrin, C.: GraphChi: large-scale graph computation on just a PC. In: Proceedings of the OSDI, pp. 31–46 (2012) Kyrola, A., Blelloch, G.E., Guestrin, C.: GraphChi: large-scale graph computation on just a PC. In: Proceedings of the OSDI, pp. 31–46 (2012)
28.
go back to reference Low, Y., Gonzalez, J., Kyrola, A., Bickson, D., Guestrin, C., Hellerstein, J.M.: Distributed GraphLab: a framework for machine learning in the cloud. PVLDB 5(8), 716–727 (2012) Low, Y., Gonzalez, J., Kyrola, A., Bickson, D., Guestrin, C., Hellerstein, J.M.: Distributed GraphLab: a framework for machine learning in the cloud. PVLDB 5(8), 716–727 (2012)
29.
go back to reference Lu, Y., Cheng, J., Yan, D., Wu, H.: Largescale distributed graph computing systems: an experimental evaluation. PVLD 8(3), 281–292 (2014) Lu, Y., Cheng, J., Yan, D., Wu, H.: Largescale distributed graph computing systems: an experimental evaluation. PVLD 8(3), 281–292 (2014)
30.
go back to reference Malewicz, G., Austern, M.H., Bik, A.J.C., Dehnert, J.C., Horn, I., Leiser, N., Czajkowski, G.: Pregel: a system for large-scale graph processing. In: Proceedings of the SIGMOD conference, pp. 135–146 (2010) Malewicz, G., Austern, M.H., Bik, A.J.C., Dehnert, J.C., Horn, I., Leiser, N., Czajkowski, G.: Pregel: a system for large-scale graph processing. In: Proceedings of the SIGMOD conference, pp. 135–146 (2010)
31.
go back to reference Page, L., Brin, S., Motwani, R., Winograd, T.: The PageRank citation ranking: bringing order to the web. Technical Report 1999–66, Stanford InfoLab, November 1999. Previous number = SIDL-WP-1999-0120 Page, L., Brin, S., Motwani, R., Winograd, T.: The PageRank citation ranking: bringing order to the web. Technical Report 1999–66, Stanford InfoLab, November 1999. Previous number = SIDL-WP-1999-0120
32.
go back to reference Sakr, S.: GraphREL: a decomposition-based and selectivity-aware relational framework for processing sub-graph queries. In: Proceedings of the DASFAA, pp. 123–137 (2009) Sakr, S.: GraphREL: a decomposition-based and selectivity-aware relational framework for processing sub-graph queries. In: Proceedings of the DASFAA, pp. 123–137 (2009)
33.
go back to reference Sakr, S., Al-Naymat, G.: Efficient relational techniques for processing graph queries. J. Comput. Sci. Technol. 25(6), 1237–1255 (2010)CrossRef Sakr, S., Al-Naymat, G.: Efficient relational techniques for processing graph queries. J. Comput. Sci. Technol. 25(6), 1237–1255 (2010)CrossRef
34.
go back to reference Sakr, S., Al-Naymat, G.: Graph indexing and querying: a review. IJWIS 6(2), 101–120 (2010) Sakr, S., Al-Naymat, G.: Graph indexing and querying: a review. IJWIS 6(2), 101–120 (2010)
35.
go back to reference Sakr, S., Pardede, E. (ed.): Graph Data Management: Techniques and Applications. IGI Global, Hershey (2011) Sakr, S., Pardede, E. (ed.): Graph Data Management: Techniques and Applications. IGI Global, Hershey (2011)
36.
go back to reference Sakr, S., Elnikety, S., He, Y.: G-SPARQL: a hybrid engine for querying large attributed graphs. In: Proceedings of the Conference on Information and Knowledge Management, CIKM (2012) Sakr, S., Elnikety, S., He, Y.: G-SPARQL: a hybrid engine for querying large attributed graphs. In: Proceedings of the Conference on Information and Knowledge Management, CIKM (2012)
37.
go back to reference Sakr, S., Liu, A., Fayoumi, A.G.: The family of mapreduce and large-scale data processing systems. ACM Comput. Surv. 46(1), 11 (2013)CrossRef Sakr, S., Liu, A., Fayoumi, A.G.: The family of mapreduce and large-scale data processing systems. ACM Comput. Surv. 46(1), 11 (2013)CrossRef
38.
go back to reference Salihoglu, S., Widom, J.: GPS: a graph processing system. In: Proceedings of the SSDBM, p. 22. ACM (2013) Salihoglu, S., Widom, J.: GPS: a graph processing system. In: Proceedings of the SSDBM, p. 22. ACM (2013)
39.
go back to reference Schad, J., Dittrich, J., Quiané-Ruiz, J.-A.: Runtime measurements in the cloud: observing, analyzing, and reducing variance. PVLDB 3(1), 460–471 (2010) Schad, J., Dittrich, J., Quiané-Ruiz, J.-A.: Runtime measurements in the cloud: observing, analyzing, and reducing variance. PVLDB 3(1), 460–471 (2010)
40.
go back to reference Shao, B., Wang, H., Li, Y.: Trinity: a distributed graph engine on a memory cloud. In: Proceedings of the International Conference on Management of Data, SIGMOD, pp. 505–516 (2013) Shao, B., Wang, H., Li, Y.: Trinity: a distributed graph engine on a memory cloud. In: Proceedings of the International Conference on Management of Data, SIGMOD, pp. 505–516 (2013)
41.
go back to reference Simmen, D.E., Schnaitter, K., Davis, J., He, Y., Lohariwala, S., Mysore, A., Shenoi, V., Tan, M., Xiao, Y.: Large-scale graph analytics in aster 6: bringing context to big data discovery. PVLDB 7(13), 1405–1416 (2014) Simmen, D.E., Schnaitter, K., Davis, J., He, Y., Lohariwala, S., Mysore, A., Shenoi, V., Tan, M., Xiao, Y.: Large-scale graph analytics in aster 6: bringing context to big data discovery. PVLDB 7(13), 1405–1416 (2014)
42.
go back to reference Stutz, P., Bernstein, A., Cohen, W.W.: Signal/collect: graph algorithms for the (semantic) web. Int. Semant. Web Conf. 1, 764–780 (2010) Stutz, P., Bernstein, A., Cohen, W.W.: Signal/collect: graph algorithms for the (semantic) web. Int. Semant. Web Conf. 1, 764–780 (2010)
43.
go back to reference Tian, Y., Balmin, A., Corsten, S.A., Tatikonda, S., McPherson, J.: From “think like a vertex” to “think like a graph”. PVLDB 7(3), 193–204 (2013) Tian, Y., Balmin, A., Corsten, S.A., Tatikonda, S., McPherson, J.: From “think like a vertex” to “think like a graph”. PVLDB 7(3), 193–204 (2013)
44.
go back to reference Valiant, L.G.: A bridging model for parallel computation. Commun. ACM 33(8), 103–111 (1990)CrossRef Valiant, L.G.: A bridging model for parallel computation. Commun. ACM 33(8), 103–111 (1990)CrossRef
45.
go back to reference Wang, G., Xie, W., Demers, A., Gehrke, J.: Asynchronous large-scale graph processing made easy. In CIDR (2013) Wang, G., Xie, W., Demers, A., Gehrke, J.: Asynchronous large-scale graph processing made easy. In CIDR (2013)
46.
go back to reference Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. In: Proceedings of the HotCloud (2010) Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. In: Proceedings of the HotCloud (2010)
47.
go back to reference Zhang, Y., Gao, Q., Gao, L., Wang, C.: iMapReduce: a distributed computing framework for iterative computation. J. Grid Comput. 10(1), 47–68 (2012)CrossRef Zhang, Y., Gao, Q., Gao, L., Wang, C.: iMapReduce: a distributed computing framework for iterative computation. J. Grid Comput. 10(1), 47–68 (2012)CrossRef
Metadata
Title
Large scale graph processing systems: survey and an experimental evaluation
Authors
Omar Batarfi
Radwa El Shawi
Ayman G. Fayoumi
Reza Nouri
Seyed-Mehdi-Reza Beheshti
Ahmed Barnawi
Sherif Sakr
Publication date
01-09-2015
Publisher
Springer US
Published in
Cluster Computing / Issue 3/2015
Print ISSN: 1386-7857
Electronic ISSN: 1573-7543
DOI
https://doi.org/10.1007/s10586-015-0472-6

Other articles of this Issue 3/2015

Cluster Computing 3/2015 Go to the issue

Premium Partner