skip to main content
10.1145/3299869.3324959acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

RaSQL: Greater Power and Performance for Big Data Analytics with Recursive-aggregate-SQL on Spark

Authors Info & Claims
Published:25 June 2019Publication History

ABSTRACT

Thanks to a simple SQL extension, Recursive-aggregate-SQL (RaSQL) can express very powerful queries and declarative algorithms, such as classical graph algorithms and data mining algorithms. A novel compiler implementation allows RaSQL to map declarative queries into one basic fixpoint operator supporting aggregates in recursive queries. A fully optimized implementation of this fixpoint operator leads to superior performance, scalability and portability. Thus, our RaSQL system, which extends Spark SQL with the before-mentioned new constructs and implementation techniques, matches and often surpasses the performance of other systems, including Apache Giraph, GraphX and Myria.

References

  1. {n. d.}. Arabic-2005. http://law.di.unimi.it/webdata/arabic-2005/.Google ScholarGoogle Scholar
  2. {n. d.}. Cypher for Apache Spark. https://github.com/opencypher/ cypher-for-apache-spark.Google ScholarGoogle Scholar
  3. {n. d.}. Floyd-Warshall Algorithm. https://en.wikipedia.org/wiki/ Floyd_Warshall_algorithm.Google ScholarGoogle Scholar
  4. {n. d.}. GTgraph. http://www.cse.psu.edu/~kxm85/ software/GTgraph.Google ScholarGoogle Scholar
  5. {n. d.}. LiveJournal social network. http://snap.stanford.edu/data/ com-LiveJournal.html.Google ScholarGoogle Scholar
  6. {n. d.}. MATCH (SQL Graph). https://docs.microsoft.com/en-us/sql/ t-sql/queries/match-sql-graph.Google ScholarGoogle Scholar
  7. {n. d.}. Multi level Marketing Model. https://en.wikipedia.org/wiki/ Multilevelmarketing.Google ScholarGoogle Scholar
  8. {n. d.}. Orkut network. http://snap.stanford.edu/data/com-Orkut.html.Google ScholarGoogle Scholar
  9. {n. d.}. Presto. http://prestodb.io.Google ScholarGoogle Scholar
  10. {n. d.}. Recursion Example: Bill Of Materials. https://www.ibm.com/ support/knowledgecenter/en/SS6NHC/com.ibm.swg.im.dashdb.sql. ref.doc/doc/r0059242.html.Google ScholarGoogle Scholar
  11. {n. d.}. The RaSQL language. http://rasql.org.Google ScholarGoogle Scholar
  12. Foto N. Afrati, Vinayak R. Borkar, et al. 2011. Map-reduce extensions and recursive queries. In EDBT. 1--8. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Michael Armbrust, Reynold S. Xin, Cheng Lian, Yin Huai, Davies Liu, Joseph K. Bradley, et al. 2015. Spark SQL: Relational Data Processing in Spark. In SIGMOD. 1383--1394.Google ScholarGoogle Scholar
  14. David F. Bacon, Nathan Bales, et al. 2017. Spanner: Becoming a SQL System. In SIGMOD 2017. 331--343. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Francois Bancilhon. 1986. Naive Evaluation of Recursively Defined Relations. In KBMS. Springer-Verlag, 165--178. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Scott Beamer, Krste Asanovic, and David A. Patterson. 2015. The GAP Benchmark Suite. CoRR abs/1508.03619 (2015). arXiv:1508.03619 http://arxiv.org/abs/1508.03619Google ScholarGoogle Scholar
  17. Vinayak R. Borkar, Michael J. Carey, Raman Grover, Nicola Onose, and Rares Vernica. 2011. Hyracks: A flexible and extensible foundation for data-intensive computing. In ICDE. 1151--1162. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Oscar Boykin, Sam Ritchie, Ian O'Connell, and Jimmy Lin. 2014. Summingbird: A framework for integrating batch and online mapreduce computations. PVLDB 7, 13 (2014), 1441--1451. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Yingyi Bu, Vinayak R. Borkar, Jianfeng Jia, Michael J. Carey, and Tyson Condie. 2014. Pregelix: Big(ger) Graph Analytics on a Dataflow Engine. PVLDB 8, 2 (2014), 161--172. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Avery Ching, Sergey Edunov, Maja Kabiljo, Dionysios Logothetis, and Sambavi Muthukrishnan. 2015. One Trillion Edges: Graph Processing at Facebook-Scale. PVLDB 8, 12 (2015), 1804--1815. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Tyson Condie, Ariyam Das, Matteo Interlandi, Alexander Shkapsky, Mohan Yang, and Carlo Zaniolo. 2018. Scaling-up reasoning and advanced analytics on BigData. TPLP 18, 5--6 (2018), 806--845.Google ScholarGoogle Scholar
  22. Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: simplified data processing on large clusters. Commun. ACM 51, 1 (2008), 107--113. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Nadime Francis, Alastair Green, et al. 2018. Cypher: An Evolving Query Language for Property Graphs. In SIGMOD. 1433--1445.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Filippo Furfaro, Sergio Greco, Sumit Ganguly, and Carlo Zaniolo. 2002. Pushing Extrema Aggregates to Optimize Logic Queries. Inf. Syst. 27, 5 (July 2002), 321--343. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Sumit Ganguly, Sergio Greco, and Carlo Zaniolo. 1991. Minimum and Maximum Predicates in Logic Programming. In PODS. 154--163. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Sumit Ganguly, Sergio Greco, and Carlo Zaniolo. 1995. Extrema Predicates in Deductive Databases. JCS 51, 2 (1995), 244--259. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Allen Gelder. 1993. Foundations of aggregation in deductive databases. In Deductive and Object-Oriented Databases. Springer, 13--34.Google ScholarGoogle Scholar
  28. Joseph E. Gonzalez, Yucheng Low, Haijie Gu, Danny Bickson, and Carlos Guestrin. 2012. PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs. In USENIX, OSDI. 17--30.Google ScholarGoogle Scholar
  29. Joseph E. Gonzalez, Reynold S. Xin, Ankur Dave, Daniel Crankshaw, Michael J. Franklin, and Ion Stoica. 2014. GraphX: Graph Processing in a Distributed Dataflow Framework. In OSDI. 599--613.Google ScholarGoogle Scholar
  30. Goetz Graefe and William McKenna. 1991. The Volcano optimizer generator. Technical Report. Colorado Univ at Boulder.Google ScholarGoogle Scholar
  31. Huahai He and Ambuj K. Singh. 2008. Graphs-at-a-time: query language and access methods for graph databases. In SIGMOD. 405--418. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. David B. Kemp and Peter J. Stuckey. 1991. Semantics of Logic Programs with Aggregates. In ISLP. 387--401.Google ScholarGoogle Scholar
  33. Jay Kreps, Neha Narkhede, Jun Rao, et al. 2011. Kafka: A distributed messaging system for log processing.Google ScholarGoogle Scholar
  34. Haewoon Kwak, Changhyun Lee, Hosung Park, and Sue B. Moon. 2010. What is Twitter, a social network or a news media?. In WWW, Raleigh, North Carolina, USA, April 26--30, 2010. 591--600.Google ScholarGoogle Scholar
  35. Yucheng Low, Danny Bickson, Joseph Gonzalez, Carlos Guestrin, Aapo Kyrola, and Joseph M. Hellerstein. 2012. Distributed GraphLab: A Framework for Machine Learning and Data Mining in the Cloud. PVLDB 5, 8 (2012), 716--727. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Yi Lu, James Cheng, Da Yan, and Huanhuan Wu. 2014. Large-Scale Distributed Graph Computing Systems: An Experimental Evaluation. PVLDB 8, 3 (2014), 281--292. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Grzegorz Malewicz, Matthew H. Austern, Aart J.C Bik, James C. Dehnert, Ilan Horn, Naty Leiser, and Grzegorz Czajkowski. 2010. Pregel: A System for Large-scale Graph Processing. In SIGMOD. 135--146. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Mirjana Mazuran, Edoardo Serra, and Carlo Zaniolo. 2013. Extending the Power of Datalog Recursion. VLDB J. 22, 4 (2013), 471--493. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Frank McSherry, Michael Isard, and Derek G Murray. 2015. Scalability! But at what COST?. In 15th Workshop on Hot Topics in Operating Systems (HotOS {XV}). Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Frank McSherry, Derek Gordon Murray, Rebecca Isaacs, and Michael Isard. 2013. Differential Dataflow. In CIDR.Google ScholarGoogle Scholar
  41. Svilen R. Mihaylov, Zachary G. Ives, and Sudipto Guha. 2012. REX: Recursive, Delta-based Data-centric Computation. PVLDB 5, 11 (July 2012), 1280--1291.Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Jack Minker, Dietmar Seipel, and Carlo Zaniolo. 2014. Logic and Databases: A History of Deductive Databases. In Computational Logic. 571--627.Google ScholarGoogle Scholar
  43. Inderpal Singh Mumick, Hamid Pirahesh, and Raghu Ramakrishnan. 1990. The Magic of Duplicates and Aggregates. In PVLDB. 264--277. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Derek G. Murray, Frank McSherry, Rebecca Isaacs, Michael Isard, Paul Barham, and Martín Abadi. 2013. Naiad: A Timely Dataflow System. In SOSP. 439--455. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Thomas Neumann. 2011. Efficiently compiling efficient query plans for modern hardware. PVLDB 4, 9 (2011), 539--550. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Kian Win Ong, Yannis Papakonstantinou, and Romain Vernoux. 2014. The SQL++ Semi-structured Data Model and Query Language. CoRR abs/1405.3631 (2014).Google ScholarGoogle Scholar
  47. Jorge Pérez, Marcelo Arenas, and Claudio Gutiérrez. 2009. Semantics and complexity of SPARQL. Database Syst. 34, 3 (2009), 16:1--16:45. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Kenneth A Ross and Yehoshua Sagiv. 1992. Monotonic aggregation in deductive databases. In PODS. 114--126. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Ariyam Das Sahil, M. Gandhi, and Carlo Zaniolo. 2018. ASTRO: A Datalog System for Advanced Stream Reasoning. CIKM (2018). Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Marianne Shaw, Paraschos Koutris, Bill Howe, and Dan Suciu. 2012. Optimizing Large-Scale Semi-Naïve Datalog Evaluation in Hadoop. In Datalog in Academia and Industry. 165--176. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Alexander Shkapsky, Mohan Yang, Matteo Interlandi, Hsuan Chiu, Tyson Condie, and Carlo Zaniolo. 2016. Big Data Analytics with Datalog Queries on Spark. In SIGMOD 2016. 1135--1149. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. S. Sudarshan and Raghu Ramakrishnan. 1991. Aggregation and Relevance in Deductive Databases. In PVLDB. 501--511. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Ashish Thusoo, Joydeep Sen Sarma, et al. 2010. Hive - a petabyte scale data warehouse using Hadoop. In ICDE 2010. 996--1005.Google ScholarGoogle ScholarCross RefCross Ref
  54. Oskar van Rest, Sungpack Hong, et al. 2016. PGQL: a property graph query language. In GRADES Workshop. 7. Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. Jingjing Wang, Tobin Baker, et al. 2017. The Myria Big Data Management and Analytics System and Cloud Services. In CIDR.Google ScholarGoogle Scholar
  56. Jingjing Wang, Magdalena Balazinska, and Daniel Halperin. 2015. Asynchronous and Fault-Tolerant Recursive Datalog Evaluation in Shared-Nothing Engines. PVLDB 8, 12 (2015), 1542--1553. Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. Markus Weimer, Tyson Condie, and Raghu Ramakrishnan. 2011. Machine learning in ScalOps, a higher order cloud computing language. In BigLearn.Google ScholarGoogle Scholar
  58. Mohan Yang and Carlo Zaniolo. 2014. Main memory evaluation of recursive queries on multicore machines. In IEEE Big Data. 251--260.Google ScholarGoogle Scholar
  59. Yuan Yu, Michael Isard, et al. 2008. DryadLINQ: A System for General- Purpose Distributed Data-Parallel Computing Using a High-Level Language. In OSDI. 1--14. Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2010. Spark: Cluster Computing withWorking Sets. In HotCloud. Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. Carlo Zaniolo, Stefano Ceri, Christos Faloutsos, Richard T. Snodgrass, V. S. Subrahmanian, and Roberto Zicari. 1997. Advanced Database Systems. Morgan Kaufmann. Google ScholarGoogle ScholarDigital LibraryDigital Library
  62. Carlo Zaniolo, Mohan Yang, et al. 2017. Fixpoint semantics and optimization of recursive Datalog programs with aggregates. TPLP 17, 5--6 (2017), 1048--1065.Google ScholarGoogle ScholarCross RefCross Ref
  63. Carlo Zaniolo, Mohan Yang, et al. 2018. Declarative BigData Algorithms via Aggregates and Relational Database Dependencies. In CEUR Workshop Proceedings 2018.Google ScholarGoogle Scholar
  64. Yanfeng Zhang, Qixin Gao, Lixin Gao, and CuirongWang. 2011. PrIter: A Distributed Framework for Prioritized Iterative Computations. In SOCC. Article 13, 13:1--13:14 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  65. Jingren Zhou, Nicolas Bruno, Ming-Chuan Wu, Per-Ake Larson, Ronnie Chaiken, and Darren Shakib. 2012. SCOPE: Parallel Databases Meet MapReduce. The VLDB Journal 21, 5 (Oct. 2012), 611--636. Google ScholarGoogle ScholarDigital LibraryDigital Library
  66. Xin Zhou, Fusheng Wang, and Carlo Zaniolo. 2006. Efficient temporal coalescing query support in relational database systems. In DEXA. Springer, 676--686. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. RaSQL: Greater Power and Performance for Big Data Analytics with Recursive-aggregate-SQL on Spark

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        SIGMOD '19: Proceedings of the 2019 International Conference on Management of Data
        June 2019
        2106 pages
        ISBN:9781450356435
        DOI:10.1145/3299869

        Copyright © 2019 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 25 June 2019

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        SIGMOD '19 Paper Acceptance Rate88of430submissions,20%Overall Acceptance Rate785of4,003submissions,20%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader