ABSTRACT
Thanks to a simple SQL extension, Recursive-aggregate-SQL (RaSQL) can express very powerful queries and declarative algorithms, such as classical graph algorithms and data mining algorithms. A novel compiler implementation allows RaSQL to map declarative queries into one basic fixpoint operator supporting aggregates in recursive queries. A fully optimized implementation of this fixpoint operator leads to superior performance, scalability and portability. Thus, our RaSQL system, which extends Spark SQL with the before-mentioned new constructs and implementation techniques, matches and often surpasses the performance of other systems, including Apache Giraph, GraphX and Myria.
- {n. d.}. Arabic-2005. http://law.di.unimi.it/webdata/arabic-2005/.Google Scholar
- {n. d.}. Cypher for Apache Spark. https://github.com/opencypher/ cypher-for-apache-spark.Google Scholar
- {n. d.}. Floyd-Warshall Algorithm. https://en.wikipedia.org/wiki/ Floyd_Warshall_algorithm.Google Scholar
- {n. d.}. GTgraph. http://www.cse.psu.edu/~kxm85/ software/GTgraph.Google Scholar
- {n. d.}. LiveJournal social network. http://snap.stanford.edu/data/ com-LiveJournal.html.Google Scholar
- {n. d.}. MATCH (SQL Graph). https://docs.microsoft.com/en-us/sql/ t-sql/queries/match-sql-graph.Google Scholar
- {n. d.}. Multi level Marketing Model. https://en.wikipedia.org/wiki/ Multilevelmarketing.Google Scholar
- {n. d.}. Orkut network. http://snap.stanford.edu/data/com-Orkut.html.Google Scholar
- {n. d.}. Presto. http://prestodb.io.Google Scholar
- {n. d.}. Recursion Example: Bill Of Materials. https://www.ibm.com/ support/knowledgecenter/en/SS6NHC/com.ibm.swg.im.dashdb.sql. ref.doc/doc/r0059242.html.Google Scholar
- {n. d.}. The RaSQL language. http://rasql.org.Google Scholar
- Foto N. Afrati, Vinayak R. Borkar, et al. 2011. Map-reduce extensions and recursive queries. In EDBT. 1--8. Google ScholarDigital Library
- Michael Armbrust, Reynold S. Xin, Cheng Lian, Yin Huai, Davies Liu, Joseph K. Bradley, et al. 2015. Spark SQL: Relational Data Processing in Spark. In SIGMOD. 1383--1394.Google Scholar
- David F. Bacon, Nathan Bales, et al. 2017. Spanner: Becoming a SQL System. In SIGMOD 2017. 331--343. Google ScholarDigital Library
- Francois Bancilhon. 1986. Naive Evaluation of Recursively Defined Relations. In KBMS. Springer-Verlag, 165--178. Google ScholarDigital Library
- Scott Beamer, Krste Asanovic, and David A. Patterson. 2015. The GAP Benchmark Suite. CoRR abs/1508.03619 (2015). arXiv:1508.03619 http://arxiv.org/abs/1508.03619Google Scholar
- Vinayak R. Borkar, Michael J. Carey, Raman Grover, Nicola Onose, and Rares Vernica. 2011. Hyracks: A flexible and extensible foundation for data-intensive computing. In ICDE. 1151--1162. Google ScholarDigital Library
- Oscar Boykin, Sam Ritchie, Ian O'Connell, and Jimmy Lin. 2014. Summingbird: A framework for integrating batch and online mapreduce computations. PVLDB 7, 13 (2014), 1441--1451. Google ScholarDigital Library
- Yingyi Bu, Vinayak R. Borkar, Jianfeng Jia, Michael J. Carey, and Tyson Condie. 2014. Pregelix: Big(ger) Graph Analytics on a Dataflow Engine. PVLDB 8, 2 (2014), 161--172. Google ScholarDigital Library
- Avery Ching, Sergey Edunov, Maja Kabiljo, Dionysios Logothetis, and Sambavi Muthukrishnan. 2015. One Trillion Edges: Graph Processing at Facebook-Scale. PVLDB 8, 12 (2015), 1804--1815. Google ScholarDigital Library
- Tyson Condie, Ariyam Das, Matteo Interlandi, Alexander Shkapsky, Mohan Yang, and Carlo Zaniolo. 2018. Scaling-up reasoning and advanced analytics on BigData. TPLP 18, 5--6 (2018), 806--845.Google Scholar
- Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: simplified data processing on large clusters. Commun. ACM 51, 1 (2008), 107--113. Google ScholarDigital Library
- Nadime Francis, Alastair Green, et al. 2018. Cypher: An Evolving Query Language for Property Graphs. In SIGMOD. 1433--1445.Google ScholarDigital Library
- Filippo Furfaro, Sergio Greco, Sumit Ganguly, and Carlo Zaniolo. 2002. Pushing Extrema Aggregates to Optimize Logic Queries. Inf. Syst. 27, 5 (July 2002), 321--343. Google ScholarDigital Library
- Sumit Ganguly, Sergio Greco, and Carlo Zaniolo. 1991. Minimum and Maximum Predicates in Logic Programming. In PODS. 154--163. Google ScholarDigital Library
- Sumit Ganguly, Sergio Greco, and Carlo Zaniolo. 1995. Extrema Predicates in Deductive Databases. JCS 51, 2 (1995), 244--259. Google ScholarDigital Library
- Allen Gelder. 1993. Foundations of aggregation in deductive databases. In Deductive and Object-Oriented Databases. Springer, 13--34.Google Scholar
- Joseph E. Gonzalez, Yucheng Low, Haijie Gu, Danny Bickson, and Carlos Guestrin. 2012. PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs. In USENIX, OSDI. 17--30.Google Scholar
- Joseph E. Gonzalez, Reynold S. Xin, Ankur Dave, Daniel Crankshaw, Michael J. Franklin, and Ion Stoica. 2014. GraphX: Graph Processing in a Distributed Dataflow Framework. In OSDI. 599--613.Google Scholar
- Goetz Graefe and William McKenna. 1991. The Volcano optimizer generator. Technical Report. Colorado Univ at Boulder.Google Scholar
- Huahai He and Ambuj K. Singh. 2008. Graphs-at-a-time: query language and access methods for graph databases. In SIGMOD. 405--418. Google ScholarDigital Library
- David B. Kemp and Peter J. Stuckey. 1991. Semantics of Logic Programs with Aggregates. In ISLP. 387--401.Google Scholar
- Jay Kreps, Neha Narkhede, Jun Rao, et al. 2011. Kafka: A distributed messaging system for log processing.Google Scholar
- Haewoon Kwak, Changhyun Lee, Hosung Park, and Sue B. Moon. 2010. What is Twitter, a social network or a news media?. In WWW, Raleigh, North Carolina, USA, April 26--30, 2010. 591--600.Google Scholar
- Yucheng Low, Danny Bickson, Joseph Gonzalez, Carlos Guestrin, Aapo Kyrola, and Joseph M. Hellerstein. 2012. Distributed GraphLab: A Framework for Machine Learning and Data Mining in the Cloud. PVLDB 5, 8 (2012), 716--727. Google ScholarDigital Library
- Yi Lu, James Cheng, Da Yan, and Huanhuan Wu. 2014. Large-Scale Distributed Graph Computing Systems: An Experimental Evaluation. PVLDB 8, 3 (2014), 281--292. Google ScholarDigital Library
- Grzegorz Malewicz, Matthew H. Austern, Aart J.C Bik, James C. Dehnert, Ilan Horn, Naty Leiser, and Grzegorz Czajkowski. 2010. Pregel: A System for Large-scale Graph Processing. In SIGMOD. 135--146. Google ScholarDigital Library
- Mirjana Mazuran, Edoardo Serra, and Carlo Zaniolo. 2013. Extending the Power of Datalog Recursion. VLDB J. 22, 4 (2013), 471--493. Google ScholarDigital Library
- Frank McSherry, Michael Isard, and Derek G Murray. 2015. Scalability! But at what COST?. In 15th Workshop on Hot Topics in Operating Systems (HotOS {XV}). Google ScholarDigital Library
- Frank McSherry, Derek Gordon Murray, Rebecca Isaacs, and Michael Isard. 2013. Differential Dataflow. In CIDR.Google Scholar
- Svilen R. Mihaylov, Zachary G. Ives, and Sudipto Guha. 2012. REX: Recursive, Delta-based Data-centric Computation. PVLDB 5, 11 (July 2012), 1280--1291.Google ScholarDigital Library
- Jack Minker, Dietmar Seipel, and Carlo Zaniolo. 2014. Logic and Databases: A History of Deductive Databases. In Computational Logic. 571--627.Google Scholar
- Inderpal Singh Mumick, Hamid Pirahesh, and Raghu Ramakrishnan. 1990. The Magic of Duplicates and Aggregates. In PVLDB. 264--277. Google ScholarDigital Library
- Derek G. Murray, Frank McSherry, Rebecca Isaacs, Michael Isard, Paul Barham, and Martín Abadi. 2013. Naiad: A Timely Dataflow System. In SOSP. 439--455. Google ScholarDigital Library
- Thomas Neumann. 2011. Efficiently compiling efficient query plans for modern hardware. PVLDB 4, 9 (2011), 539--550. Google ScholarDigital Library
- Kian Win Ong, Yannis Papakonstantinou, and Romain Vernoux. 2014. The SQL++ Semi-structured Data Model and Query Language. CoRR abs/1405.3631 (2014).Google Scholar
- Jorge Pérez, Marcelo Arenas, and Claudio Gutiérrez. 2009. Semantics and complexity of SPARQL. Database Syst. 34, 3 (2009), 16:1--16:45. Google ScholarDigital Library
- Kenneth A Ross and Yehoshua Sagiv. 1992. Monotonic aggregation in deductive databases. In PODS. 114--126. Google ScholarDigital Library
- Ariyam Das Sahil, M. Gandhi, and Carlo Zaniolo. 2018. ASTRO: A Datalog System for Advanced Stream Reasoning. CIKM (2018). Google ScholarDigital Library
- Marianne Shaw, Paraschos Koutris, Bill Howe, and Dan Suciu. 2012. Optimizing Large-Scale Semi-Naïve Datalog Evaluation in Hadoop. In Datalog in Academia and Industry. 165--176. Google ScholarDigital Library
- Alexander Shkapsky, Mohan Yang, Matteo Interlandi, Hsuan Chiu, Tyson Condie, and Carlo Zaniolo. 2016. Big Data Analytics with Datalog Queries on Spark. In SIGMOD 2016. 1135--1149. Google ScholarDigital Library
- S. Sudarshan and Raghu Ramakrishnan. 1991. Aggregation and Relevance in Deductive Databases. In PVLDB. 501--511. Google ScholarDigital Library
- Ashish Thusoo, Joydeep Sen Sarma, et al. 2010. Hive - a petabyte scale data warehouse using Hadoop. In ICDE 2010. 996--1005.Google ScholarCross Ref
- Oskar van Rest, Sungpack Hong, et al. 2016. PGQL: a property graph query language. In GRADES Workshop. 7. Google ScholarDigital Library
- Jingjing Wang, Tobin Baker, et al. 2017. The Myria Big Data Management and Analytics System and Cloud Services. In CIDR.Google Scholar
- Jingjing Wang, Magdalena Balazinska, and Daniel Halperin. 2015. Asynchronous and Fault-Tolerant Recursive Datalog Evaluation in Shared-Nothing Engines. PVLDB 8, 12 (2015), 1542--1553. Google ScholarDigital Library
- Markus Weimer, Tyson Condie, and Raghu Ramakrishnan. 2011. Machine learning in ScalOps, a higher order cloud computing language. In BigLearn.Google Scholar
- Mohan Yang and Carlo Zaniolo. 2014. Main memory evaluation of recursive queries on multicore machines. In IEEE Big Data. 251--260.Google Scholar
- Yuan Yu, Michael Isard, et al. 2008. DryadLINQ: A System for General- Purpose Distributed Data-Parallel Computing Using a High-Level Language. In OSDI. 1--14. Google ScholarDigital Library
- Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2010. Spark: Cluster Computing withWorking Sets. In HotCloud. Google ScholarDigital Library
- Carlo Zaniolo, Stefano Ceri, Christos Faloutsos, Richard T. Snodgrass, V. S. Subrahmanian, and Roberto Zicari. 1997. Advanced Database Systems. Morgan Kaufmann. Google ScholarDigital Library
- Carlo Zaniolo, Mohan Yang, et al. 2017. Fixpoint semantics and optimization of recursive Datalog programs with aggregates. TPLP 17, 5--6 (2017), 1048--1065.Google ScholarCross Ref
- Carlo Zaniolo, Mohan Yang, et al. 2018. Declarative BigData Algorithms via Aggregates and Relational Database Dependencies. In CEUR Workshop Proceedings 2018.Google Scholar
- Yanfeng Zhang, Qixin Gao, Lixin Gao, and CuirongWang. 2011. PrIter: A Distributed Framework for Prioritized Iterative Computations. In SOCC. Article 13, 13:1--13:14 pages. Google ScholarDigital Library
- Jingren Zhou, Nicolas Bruno, Ming-Chuan Wu, Per-Ake Larson, Ronnie Chaiken, and Darren Shakib. 2012. SCOPE: Parallel Databases Meet MapReduce. The VLDB Journal 21, 5 (Oct. 2012), 611--636. Google ScholarDigital Library
- Xin Zhou, Fusheng Wang, and Carlo Zaniolo. 2006. Efficient temporal coalescing query support in relational database systems. In DEXA. Springer, 676--686. Google ScholarDigital Library
Index Terms
- RaSQL: Greater Power and Performance for Big Data Analytics with Recursive-aggregate-SQL on Spark
Recommendations
RASQL: A Powerful Language and its System for Big Data Applications
SIGMOD '20: Proceedings of the 2020 ACM SIGMOD International Conference on Management of DataThere is a growing interest in supporting advanced Big Data applications on distributed data processing platforms. Most of these systems support SQL or its dialect as the query interface due to its portability and declarative nature. However, current ...
From graphs to tables the design of scalable systems for graph analytics
WWW '14 Companion: Proceedings of the 23rd International Conference on World Wide WebFrom social networks to language modeling, the growing scale and importance of graph data has driven the development of new graph-parallel systems. In this talk, I will review the graph-parallel abstraction and describe how it can be used to express ...
GeoSparkViz: a scalable geospatial data visualization framework in the apache spark ecosystem
SSDBM '18: Proceedings of the 30th International Conference on Scientific and Statistical Database ManagementData Visualization allows users to summarize, analyze and reason about data. A map visualization tool first loads the designated geospatial data, processes the data and then applies the map visualization effect. Guaranteeing detailed and accurate ...
Comments