Top

Published in:

01-09-2015

Large scale graph processing systems: survey and an experimental evaluation

Authors: Omar Batarfi, Radwa El Shawi, Ayman G. Fayoumi, Reza Nouri, Seyed-Mehdi-Reza Beheshti, Ahmed Barnawi, Sherif Sakr

Published in: Cluster Computing | Issue 3/2015

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Off

Abstract

Graph is a fundamental data structure that captures relationships between different data entities. In practice, graphs are widely used for modeling complicated data in different application domains such as social networks, protein networks, transportation networks, bibliographical networks, knowledge bases and many more. Currently, graphs with millions and billions of nodes and edges have become very common. In principle, graph analytics is an important big data discovery technique. Therefore, with the increasing abundance of large graphs, designing scalable systems for processing and analyzing large scale graphs has become one of the most timely problems facing the big data research community. In general, scalable processing of big graphs is a challenging task due to their size and the inherent irregular structure of graph computations. Thus, in recent years, we have witnessed an unprecedented interest in building big graph processing systems that attempted to tackle these challenges. In this article, we provide a comprehensive survey over the state-of-the-art of large scale graph processing platforms. In addition, we present an extensive experimental study of five popular systems in this domain, namely, GraphChi, Apache Giraph, GPS, GraphLab and GraphX. In particular, we report and analyze the performance characteristics of these systems using five common graph processing algorithms and seven large graph datasets. Finally, we identify a set of the current open research challenges and discuss some promising directions for future research in the domain of large scale graph processing.

previous article Hybrid service matchmaking in ambient assisted living environments based on context-aware service modeling

next article On separation of platform-independent particles in user interfaces

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

http://www.insidefacebook.com/2012/10/04/facebook-reaches-billion-user-milestone/.

http://hadoop.apache.org/.

http://pig.apache.org/.

https://hive.apache.org/.

http://giraph.apache.org/.

http://hama.apache.org/.

http://graphlab.com/.

https://github.com/GraphExperiments.

http://systemg.research.ibm.com/analytics-search-gbase.html.

http://www.gzip.org/.

http://www.cs.cmu.edu/~pegasus/.

http://zookeeper.apache.org/.

http://infolab.stanford.edu/gps/.

http://www.cse.cuhk.edu.hk/pregelplus/.

http://pregelix.ics.uci.edu/.

http://graphlab.org/projects/graphchi.html.

http://research.microsoft.com/en-us/projects/trinity/.

https://code.google.com/p/signal-collect/.

http://akka.io/.

https://spark.apache.org/.

http://wshan.net/turbograph.

http://www.cs.cornell.edu/bigreddata/grace/.

http://snap.stanford.edu/data/web-Amazon.html.

https://snap.stanford.edu/data/com-Friendster.html.

https://snap.stanford.edu/data/cit-Patents.html.

https://snap.stanford.edu/data/wiki-Talk.html.

http://swat.cse.lehigh.edu/projects/lubm/.

Please refer to a table with specification of EC2 instances on http://aws.amazon.com/ec2/instance-types/.

http://aws.amazon.com/cloudwatch/.

http://docs.aws.amazon.com/AmazonCloudWatch/latest/DeveloperGuide/mon-scripts-perl.html.

http://hortonworks.com/hadoop/ambari/.

http://snap.stanford.edu/data/web-Amazon.html.

http://snap.stanford.edu/data/.

http://law.di.unimi.it/datasets.php.

http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html.

http://stratosphere.eu/.

http://neo4j.com/.

Abouzeid, A., Bajda-Pawlikowski, K., Abadi, D.J., Rasin, A., Silberschatz, A.: HadoopDB: an architectural hybrid of MapReduce and DBMS technologies for analytical workloads. PVLDB 2(1), 922–933 (2009)

Alexandrov, A., Bergmann, R., Ewen, S., Freytag, J., Hueske, F., Heise, A., Kao, O., Leich, M., Leser, U., Markl, V., Naumann, F., Peters, M., Rheinländer, A., Sax, M.J., Schelter, S., Höger, M., Tzoumas, K., Warneke, D.: The stratosphere platform for big data analytics. VLDB J. 23(6), 939–964 (2014)CrossRef

Barnawi, A., Batarfi, O., Elshawi, R., Fayoumi, A., Nouri, R., Sakr, S.: On characterizing the performance of distributed graph computation platforms. In: Proceedings of the TPC Technology Conference, TPCTC. Springer, Berlin (2014)

Borkar, V.R., Carey, M.J., Grover, R., Onose, N., Vernica, R.: Hyracks: a flexible and extensible foundation for data-intensive computing. In: Proceedings of the international conference on Data Engineering, ICDE, pp. 1151–1162. IEEE (2011)

Bu, Y., Howe, B., Balazinska, M., Ernst, M.D.: The HaLoop approach to large-scale iterative data analysis. VLDB J. 21(2), 169–190 (2012)CrossRef

Bu, Y., Borkar, V.R., Jia, J., Carey, M.J., Condie, T.: Pregelix: Big(ger) graph analytics on a dataflow engine. PVLDB 8(2), 161–172 (2014)

Chattopadhyay, B., Lin, L., Liu, W., Mittal, S., Aragonda, P., Lychagina, V., Kwon, Y., Wong, M.: Tenzing a SQL implementation on the MapReduce framework. PVLDB 4(12), 1318–1327 (2011)

Chen, R., Weng, X., He, B., Yang, M.: Large graph processing in the cloud. In: Proceedings of the SIGMOD, pp. 1123–1126. ACM (2010)

Clinger, W.D.: Foundations of Actor Semantics. Technical Report, Cambridge, MA (1981)

10.

Dean, J., Ghemawa, S.: MapReduce: simplified data processing on large clusters. OSDI 1, 137–150 (2004)

11.

Ediger, D., Bader, D.A.: Investigating graph algorithms in the BSP model on the cray XMT. In: Proceedings of the IPDPS workshops (2013)

12.

Ekanayake, J., Li, H., Zhang, B., Gunarathne, T., Bae, S.-H., Qiu, J., Fox, G.: Twister: a runtime for iterative MapReduce. In: Proceedings of the High Performance Distributed Computing, HPDC, pp. 810–818. ACM (2010)

13.

Fard, A., Nisar, M.U., Ramaswamy, L., Miller, J.A., Saltz, M.: A distributed vertex-centric approach for pattern matching in massive graphs. In: Proceedings of the BigData conference, pp. 403–411 (2013)

14.

Friedman, E., Pawlowski, P.M., Cieslewicz, J.: SQL/MapReduce: a practical approach to self-describing, polymorphic, and parallelizable user-defined functions. PVLDB 2(2), 1402–1413 (2009)

15.

Gonzalez, J.E., Low, Y., Gu, H., Bickson, D., Guestrin, C.: PowerGraph: distributed graph-parallel computation on natural graphs. In: Proceedings of the Operating Systems Design and Implementation, OSDI, pp. 17–30 (2012)

16.

Gonzalez, J.E., Xin, R.S., Dave, A., Crankshaw, D., Franklin, M.J., Stoica, I.: GraphX: graph processing in a distributed dataflow framework. In: Proceedings of the OSDI, pp. 599–613 (2014)

17.

Guo, Y., Biczak, M., Varbanescu, A.L., Iosup, A., Martella, C., Willke, T.L.: How well do graph-processing platforms perform? An empirical performance evaluation and analysis. In: Proceedings of the International Parallel and Distributed Processing Symposiumm, IPDPS, pp. 395–404 (2014)

18.

Guo, Y., Varbanescu, A.L., Iosup, A., Martella, C., Willke, T.L.: Benchmarking graph-processing platforms: a vision. In: Proceedings of the International Conference on Performance Engineering, ICPE, pp. 289–292 (2014)

19.

Han, W., Lee, S., Park, K., Lee, J., Kim, M., Kim, J., Yu, H.: TurboGraph: a fast parallel graph engine handling billion-scale graphs in a single PC. In: Proceedings of the KDD, pp. 77–85 (2013)

20.

Han, M., Daudjee, K., Ammar, K., Özsu, M.T., Wang, X., Jin, T.: An experimental comparison of Pregel-like graph processing systems. PVLDB 7(12), 1047–1058 (2014)

21.

Herodotou, H., Lim, H., Luo, G., Borisov, N., Dong, L., Cetin, F.B., Babu, S.: Starfish: a Self-tuning system for big data analytics. In: Proceedings of the Conference on Innovative Data Systems Research, CIDR, pp. 261–272 (2011)

22.

Kang, U., Tsourakakis, C.E., Faloutsos, C.: PEGASUS: a peta-scale graph mining system. In: Proceedings of the International Conference on Data Mining, ICDM, pp. 229–238 (2009)

23.

Kang, U., Meeder, B., Faloutsos, C.: Spectral analysis for billion-scale graphs: discoveries and implementation. In: Proceedings of the PAKDD, pp. 13–25 (2011)

24.

Kang, U., Tsourakakis, C.E., Faloutsos, C.: PEGASUS: mining peta-scale graphs. Knowl. Inf. Syst. 27(2), 303–325 (2011)CrossRef

25.

Kang, U., Tong, H., Sun, J., Lin, C.-Y., Faloutsos, C.: GBASE: a scalable and general graph management system. In: Proceedings of the international conference on Knowledge Discovery and Data Mining, KDD, pp. 1091–1099 (2011)

26.

Khayyat, Z., Awara, K., Alonazi, A., Jamjoom, H., Williams, D., Kalnis, P.: Mizan: a system for dynamic load balancing in large-scale graph processing. In: Proceedings of the European Conference on Computer Systems, EuroSys, pp. 169–182. ACM (2013)

27.

Kyrola, A., Blelloch, G.E., Guestrin, C.: GraphChi: large-scale graph computation on just a PC. In: Proceedings of the OSDI, pp. 31–46 (2012)

28.

Low, Y., Gonzalez, J., Kyrola, A., Bickson, D., Guestrin, C., Hellerstein, J.M.: Distributed GraphLab: a framework for machine learning in the cloud. PVLDB 5(8), 716–727 (2012)

29.

Lu, Y., Cheng, J., Yan, D., Wu, H.: Largescale distributed graph computing systems: an experimental evaluation. PVLD 8(3), 281–292 (2014)

30.

Malewicz, G., Austern, M.H., Bik, A.J.C., Dehnert, J.C., Horn, I., Leiser, N., Czajkowski, G.: Pregel: a system for large-scale graph processing. In: Proceedings of the SIGMOD conference, pp. 135–146 (2010)

31.

Page, L., Brin, S., Motwani, R., Winograd, T.: The PageRank citation ranking: bringing order to the web. Technical Report 1999–66, Stanford InfoLab, November 1999. Previous number = SIDL-WP-1999-0120

32.

Sakr, S.: GraphREL: a decomposition-based and selectivity-aware relational framework for processing sub-graph queries. In: Proceedings of the DASFAA, pp. 123–137 (2009)

33.

Sakr, S., Al-Naymat, G.: Efficient relational techniques for processing graph queries. J. Comput. Sci. Technol. 25(6), 1237–1255 (2010)CrossRef

34.

Sakr, S., Al-Naymat, G.: Graph indexing and querying: a review. IJWIS 6(2), 101–120 (2010)

35.

Sakr, S., Pardede, E. (ed.): Graph Data Management: Techniques and Applications. IGI Global, Hershey (2011)

36.

Sakr, S., Elnikety, S., He, Y.: G-SPARQL: a hybrid engine for querying large attributed graphs. In: Proceedings of the Conference on Information and Knowledge Management, CIKM (2012)

37.

Sakr, S., Liu, A., Fayoumi, A.G.: The family of mapreduce and large-scale data processing systems. ACM Comput. Surv. 46(1), 11 (2013)CrossRef

38.

Salihoglu, S., Widom, J.: GPS: a graph processing system. In: Proceedings of the SSDBM, p. 22. ACM (2013)

39.

Schad, J., Dittrich, J., Quiané-Ruiz, J.-A.: Runtime measurements in the cloud: observing, analyzing, and reducing variance. PVLDB 3(1), 460–471 (2010)

40.

Shao, B., Wang, H., Li, Y.: Trinity: a distributed graph engine on a memory cloud. In: Proceedings of the International Conference on Management of Data, SIGMOD, pp. 505–516 (2013)

41.

Simmen, D.E., Schnaitter, K., Davis, J., He, Y., Lohariwala, S., Mysore, A., Shenoi, V., Tan, M., Xiao, Y.: Large-scale graph analytics in aster 6: bringing context to big data discovery. PVLDB 7(13), 1405–1416 (2014)

42.

Stutz, P., Bernstein, A., Cohen, W.W.: Signal/collect: graph algorithms for the (semantic) web. Int. Semant. Web Conf. 1, 764–780 (2010)

43.

Tian, Y., Balmin, A., Corsten, S.A., Tatikonda, S., McPherson, J.: From “think like a vertex” to “think like a graph”. PVLDB 7(3), 193–204 (2013)

44.

Valiant, L.G.: A bridging model for parallel computation. Commun. ACM 33(8), 103–111 (1990)CrossRef

45.

Wang, G., Xie, W., Demers, A., Gehrke, J.: Asynchronous large-scale graph processing made easy. In CIDR (2013)

46.

Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. In: Proceedings of the HotCloud (2010)

47.

Zhang, Y., Gao, Q., Gao, L., Wang, C.: iMapReduce: a distributed computing framework for iterative computation. J. Grid Comput. 10(1), 47–68 (2012)CrossRef

Title: Large scale graph processing systems: survey and an experimental evaluation
Authors: Omar Batarfi
Radwa El Shawi
Ayman G. Fayoumi
Reza Nouri
Seyed-Mehdi-Reza Beheshti
Ahmed Barnawi
Sherif Sakr
Publication date: 01-09-2015
Publisher: Springer US
Published in: Cluster Computing / Issue 3/2015
Print ISSN: 1386-7857
Electronic ISSN: 1573-7543
DOI: https://doi.org/10.1007/s10586-015-0472-6

Springer Professional

Abstract

Please log in to get access to your license.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Other articles of this Issue 3/2015

On separation of platform-independent particles in user interfaces

VM auto-scaling methods for high throughput computing on hybrid infrastructure

High performance parallelization of Boyer–Moore algorithm on many-core accelerators

Hybrid service matchmaking in ambient assisted living environments based on context-aware service modeling

A data transmission algorithm for distributed computing system based on maximum flow

GRAPPE : a system for determining optimal connecting route to target person based on mutual intimacy index

Premium Partner