research-article

RaSQL: Greater Power and Performance for Big Data Analytics with Recursive-aggregate-SQL on Spark

Authors:
Jiaqi Gu

University of California, Los Angeles, Los Angeles, CA, USA

University of California, Los Angeles, Los Angeles, CA, USA
View Profile

,
Yugo H. Watanabe

University of California, Los Angeles, Los Angeles, CA, USA

University of California, Los Angeles, Los Angeles, CA, USA
View Profile

,
William A. Mazza

Univ. of Naples Federico II, Naples, Italy

Univ. of Naples Federico II, Naples, Italy
View Profile

,
Alexander Shkapsky

Workday, Inc., Pleasanton, CA, USA

Workday, Inc., Pleasanton, CA, USA
View Profile

,
Mohan Yang

Google LLC, Mountain View, CA, USA

Google LLC, Mountain View, CA, USA
View Profile

,
Ling Ding

University of California, Los Angeles, Los Angeles, CA, USA

University of California, Los Angeles, Los Angeles, CA, USA
View Profile

,
Carlo Zaniolo

University of California, Los Angeles, Los Angeles, CA, USA

University of California, Los Angeles, Los Angeles, CA, USA
View Profile

SIGMOD '19: Proceedings of the 2019 International Conference on Management of DataJune 2019Pages 467–484https://doi.org/10.1145/3299869.3324959

Published:25 June 2019Publication History

SIGMOD '19: Proceedings of the 2019 International Conference on Management of Data

Pages 467–484

ABSTRACT

Thanks to a simple SQL extension, Recursive-aggregate-SQL (RaSQL) can express very powerful queries and declarative algorithms, such as classical graph algorithms and data mining algorithms. A novel compiler implementation allows RaSQL to map declarative queries into one basic fixpoint operator supporting aggregates in recursive queries. A fully optimized implementation of this fixpoint operator leads to superior performance, scalability and portability. Thus, our RaSQL system, which extends Spark SQL with the before-mentioned new constructs and implementation techniques, matches and often surpasses the performance of other systems, including Apache Giraph, GraphX and Myria.

References

{n. d.}. Arabic-2005. http://law.di.unimi.it/webdata/arabic-2005/.Google Scholar
{n. d.}. Cypher for Apache Spark. https://github.com/opencypher/ cypher-for-apache-spark.Google Scholar
{n. d.}. Floyd-Warshall Algorithm. https://en.wikipedia.org/wiki/ Floyd_Warshall_algorithm.Google Scholar
{n. d.}. GTgraph. http://www.cse.psu.edu/~kxm85/ software/GTgraph.Google Scholar
{n. d.}. LiveJournal social network. http://snap.stanford.edu/data/ com-LiveJournal.html.Google Scholar
{n. d.}. MATCH (SQL Graph). https://docs.microsoft.com/en-us/sql/ t-sql/queries/match-sql-graph.Google Scholar
{n. d.}. Multi level Marketing Model. https://en.wikipedia.org/wiki/ Multilevelmarketing.Google Scholar
{n. d.}. Orkut network. http://snap.stanford.edu/data/com-Orkut.html.Google Scholar
{n. d.}. Presto. http://prestodb.io.Google Scholar
{n. d.}. Recursion Example: Bill Of Materials. https://www.ibm.com/ support/knowledgecenter/en/SS6NHC/com.ibm.swg.im.dashdb.sql. ref.doc/doc/r0059242.html.Google Scholar
{n. d.}. The RaSQL language. http://rasql.org.Google Scholar
Foto N. Afrati, Vinayak R. Borkar, et al. 2011. Map-reduce extensions and recursive queries. In EDBT. 1--8. Google ScholarDigital Library
Michael Armbrust, Reynold S. Xin, Cheng Lian, Yin Huai, Davies Liu, Joseph K. Bradley, et al. 2015. Spark SQL: Relational Data Processing in Spark. In SIGMOD. 1383--1394.Google Scholar
David F. Bacon, Nathan Bales, et al. 2017. Spanner: Becoming a SQL System. In SIGMOD 2017. 331--343. Google ScholarDigital Library
Francois Bancilhon. 1986. Naive Evaluation of Recursively Defined Relations. In KBMS. Springer-Verlag, 165--178. Google ScholarDigital Library
Scott Beamer, Krste Asanovic, and David A. Patterson. 2015. The GAP Benchmark Suite. CoRR abs/1508.03619 (2015). arXiv:1508.03619 http://arxiv.org/abs/1508.03619Google Scholar
Vinayak R. Borkar, Michael J. Carey, Raman Grover, Nicola Onose, and Rares Vernica. 2011. Hyracks: A flexible and extensible foundation for data-intensive computing. In ICDE. 1151--1162. Google ScholarDigital Library
Oscar Boykin, Sam Ritchie, Ian O'Connell, and Jimmy Lin. 2014. Summingbird: A framework for integrating batch and online mapreduce computations. PVLDB 7, 13 (2014), 1441--1451. Google ScholarDigital Library
Yingyi Bu, Vinayak R. Borkar, Jianfeng Jia, Michael J. Carey, and Tyson Condie. 2014. Pregelix: Big(ger) Graph Analytics on a Dataflow Engine. PVLDB 8, 2 (2014), 161--172. Google ScholarDigital Library
Avery Ching, Sergey Edunov, Maja Kabiljo, Dionysios Logothetis, and Sambavi Muthukrishnan. 2015. One Trillion Edges: Graph Processing at Facebook-Scale. PVLDB 8, 12 (2015), 1804--1815. Google ScholarDigital Library
Tyson Condie, Ariyam Das, Matteo Interlandi, Alexander Shkapsky, Mohan Yang, and Carlo Zaniolo. 2018. Scaling-up reasoning and advanced analytics on BigData. TPLP 18, 5--6 (2018), 806--845.Google Scholar
Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: simplified data processing on large clusters. Commun. ACM 51, 1 (2008), 107--113. Google ScholarDigital Library
Nadime Francis, Alastair Green, et al. 2018. Cypher: An Evolving Query Language for Property Graphs. In SIGMOD. 1433--1445.Google ScholarDigital Library
Filippo Furfaro, Sergio Greco, Sumit Ganguly, and Carlo Zaniolo. 2002. Pushing Extrema Aggregates to Optimize Logic Queries. Inf. Syst. 27, 5 (July 2002), 321--343. Google ScholarDigital Library
Sumit Ganguly, Sergio Greco, and Carlo Zaniolo. 1991. Minimum and Maximum Predicates in Logic Programming. In PODS. 154--163. Google ScholarDigital Library
Sumit Ganguly, Sergio Greco, and Carlo Zaniolo. 1995. Extrema Predicates in Deductive Databases. JCS 51, 2 (1995), 244--259. Google ScholarDigital Library
Allen Gelder. 1993. Foundations of aggregation in deductive databases. In Deductive and Object-Oriented Databases. Springer, 13--34.Google Scholar
Joseph E. Gonzalez, Yucheng Low, Haijie Gu, Danny Bickson, and Carlos Guestrin. 2012. PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs. In USENIX, OSDI. 17--30.Google Scholar
Joseph E. Gonzalez, Reynold S. Xin, Ankur Dave, Daniel Crankshaw, Michael J. Franklin, and Ion Stoica. 2014. GraphX: Graph Processing in a Distributed Dataflow Framework. In OSDI. 599--613.Google Scholar
Goetz Graefe and William McKenna. 1991. The Volcano optimizer generator. Technical Report. Colorado Univ at Boulder.Google Scholar
Huahai He and Ambuj K. Singh. 2008. Graphs-at-a-time: query language and access methods for graph databases. In SIGMOD. 405--418. Google ScholarDigital Library
David B. Kemp and Peter J. Stuckey. 1991. Semantics of Logic Programs with Aggregates. In ISLP. 387--401.Google Scholar
Jay Kreps, Neha Narkhede, Jun Rao, et al. 2011. Kafka: A distributed messaging system for log processing.Google Scholar
Haewoon Kwak, Changhyun Lee, Hosung Park, and Sue B. Moon. 2010. What is Twitter, a social network or a news media?. In WWW, Raleigh, North Carolina, USA, April 26--30, 2010. 591--600.Google Scholar
Yucheng Low, Danny Bickson, Joseph Gonzalez, Carlos Guestrin, Aapo Kyrola, and Joseph M. Hellerstein. 2012. Distributed GraphLab: A Framework for Machine Learning and Data Mining in the Cloud. PVLDB 5, 8 (2012), 716--727. Google ScholarDigital Library
Yi Lu, James Cheng, Da Yan, and Huanhuan Wu. 2014. Large-Scale Distributed Graph Computing Systems: An Experimental Evaluation. PVLDB 8, 3 (2014), 281--292. Google ScholarDigital Library
Grzegorz Malewicz, Matthew H. Austern, Aart J.C Bik, James C. Dehnert, Ilan Horn, Naty Leiser, and Grzegorz Czajkowski. 2010. Pregel: A System for Large-scale Graph Processing. In SIGMOD. 135--146. Google ScholarDigital Library
Mirjana Mazuran, Edoardo Serra, and Carlo Zaniolo. 2013. Extending the Power of Datalog Recursion. VLDB J. 22, 4 (2013), 471--493. Google ScholarDigital Library
Frank McSherry, Michael Isard, and Derek G Murray. 2015. Scalability! But at what COST?. In 15th Workshop on Hot Topics in Operating Systems (HotOS {XV}). Google ScholarDigital Library
Frank McSherry, Derek Gordon Murray, Rebecca Isaacs, and Michael Isard. 2013. Differential Dataflow. In CIDR.Google Scholar
Svilen R. Mihaylov, Zachary G. Ives, and Sudipto Guha. 2012. REX: Recursive, Delta-based Data-centric Computation. PVLDB 5, 11 (July 2012), 1280--1291.Google ScholarDigital Library
Jack Minker, Dietmar Seipel, and Carlo Zaniolo. 2014. Logic and Databases: A History of Deductive Databases. In Computational Logic. 571--627.Google Scholar
Inderpal Singh Mumick, Hamid Pirahesh, and Raghu Ramakrishnan. 1990. The Magic of Duplicates and Aggregates. In PVLDB. 264--277. Google ScholarDigital Library
Derek G. Murray, Frank McSherry, Rebecca Isaacs, Michael Isard, Paul Barham, and Martín Abadi. 2013. Naiad: A Timely Dataflow System. In SOSP. 439--455. Google ScholarDigital Library
Thomas Neumann. 2011. Efficiently compiling efficient query plans for modern hardware. PVLDB 4, 9 (2011), 539--550. Google ScholarDigital Library
Kian Win Ong, Yannis Papakonstantinou, and Romain Vernoux. 2014. The SQL++ Semi-structured Data Model and Query Language. CoRR abs/1405.3631 (2014).Google Scholar
Jorge Pérez, Marcelo Arenas, and Claudio Gutiérrez. 2009. Semantics and complexity of SPARQL. Database Syst. 34, 3 (2009), 16:1--16:45. Google ScholarDigital Library
Kenneth A Ross and Yehoshua Sagiv. 1992. Monotonic aggregation in deductive databases. In PODS. 114--126. Google ScholarDigital Library
Ariyam Das Sahil, M. Gandhi, and Carlo Zaniolo. 2018. ASTRO: A Datalog System for Advanced Stream Reasoning. CIKM (2018). Google ScholarDigital Library
Marianne Shaw, Paraschos Koutris, Bill Howe, and Dan Suciu. 2012. Optimizing Large-Scale Semi-Naïve Datalog Evaluation in Hadoop. In Datalog in Academia and Industry. 165--176. Google ScholarDigital Library
Alexander Shkapsky, Mohan Yang, Matteo Interlandi, Hsuan Chiu, Tyson Condie, and Carlo Zaniolo. 2016. Big Data Analytics with Datalog Queries on Spark. In SIGMOD 2016. 1135--1149. Google ScholarDigital Library
S. Sudarshan and Raghu Ramakrishnan. 1991. Aggregation and Relevance in Deductive Databases. In PVLDB. 501--511. Google ScholarDigital Library
Ashish Thusoo, Joydeep Sen Sarma, et al. 2010. Hive - a petabyte scale data warehouse using Hadoop. In ICDE 2010. 996--1005.Google ScholarCross Ref
Oskar van Rest, Sungpack Hong, et al. 2016. PGQL: a property graph query language. In GRADES Workshop. 7. Google ScholarDigital Library
Jingjing Wang, Tobin Baker, et al. 2017. The Myria Big Data Management and Analytics System and Cloud Services. In CIDR.Google Scholar
Jingjing Wang, Magdalena Balazinska, and Daniel Halperin. 2015. Asynchronous and Fault-Tolerant Recursive Datalog Evaluation in Shared-Nothing Engines. PVLDB 8, 12 (2015), 1542--1553. Google ScholarDigital Library
Markus Weimer, Tyson Condie, and Raghu Ramakrishnan. 2011. Machine learning in ScalOps, a higher order cloud computing language. In BigLearn.Google Scholar
Mohan Yang and Carlo Zaniolo. 2014. Main memory evaluation of recursive queries on multicore machines. In IEEE Big Data. 251--260.Google Scholar
Yuan Yu, Michael Isard, et al. 2008. DryadLINQ: A System for General- Purpose Distributed Data-Parallel Computing Using a High-Level Language. In OSDI. 1--14. Google ScholarDigital Library
Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2010. Spark: Cluster Computing withWorking Sets. In HotCloud. Google ScholarDigital Library
Carlo Zaniolo, Stefano Ceri, Christos Faloutsos, Richard T. Snodgrass, V. S. Subrahmanian, and Roberto Zicari. 1997. Advanced Database Systems. Morgan Kaufmann. Google ScholarDigital Library
Carlo Zaniolo, Mohan Yang, et al. 2017. Fixpoint semantics and optimization of recursive Datalog programs with aggregates. TPLP 17, 5--6 (2017), 1048--1065.Google ScholarCross Ref
Carlo Zaniolo, Mohan Yang, et al. 2018. Declarative BigData Algorithms via Aggregates and Relational Database Dependencies. In CEUR Workshop Proceedings 2018.Google Scholar
Yanfeng Zhang, Qixin Gao, Lixin Gao, and CuirongWang. 2011. PrIter: A Distributed Framework for Prioritized Iterative Computations. In SOCC. Article 13, 13:1--13:14 pages. Google ScholarDigital Library
Jingren Zhou, Nicolas Bruno, Ming-Chuan Wu, Per-Ake Larson, Ronnie Chaiken, and Darren Shakib. 2012. SCOPE: Parallel Databases Meet MapReduce. The VLDB Journal 21, 5 (Oct. 2012), 611--636. Google ScholarDigital Library
Xin Zhou, Fusheng Wang, and Carlo Zaniolo. 2006. Efficient temporal coalescing query support in relational database systems. In DEXA. Springer, 676--686. Google ScholarDigital Library

Index Terms

RaSQL: Greater Power and Performance for Big Data Analytics with Recursive-aggregate-SQL on Spark
1. Information systems
  1. Data management systems
    1. Database management system engines
      1. Database query processing
    2. Query languages
      1. Relational database query languages

Recommendations

RASQL: A Powerful Language and its System for Big Data Applications
SIGMOD '20: Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data

There is a growing interest in supporting advanced Big Data applications on distributed data processing platforms. Most of these systems support SQL or its dialect as the query interface due to its portability and declarative nature. However, current ...
Read More
From graphs to tables the design of scalable systems for graph analytics
WWW '14 Companion: Proceedings of the 23rd International Conference on World Wide Web

From social networks to language modeling, the growing scale and importance of graph data has driven the development of new graph-parallel systems. In this talk, I will review the graph-parallel abstraction and describe how it can be used to express ...
Read More
GeoSparkViz: a scalable geospatial data visualization framework in the apache spark ecosystem
SSDBM '18: Proceedings of the 30th International Conference on Scientific and Statistical Database Management

Data Visualization allows users to summarize, analyze and reason about data. A map visualization tool first loads the designated geospatial data, processes the data and then applies the map visualization effect. Guaranteeing detailed and accurate ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGMOD '19: Proceedings of the 2019 International Conference on Management of Data
June 2019
2106 pages
ISBN:9781450356435
DOI:10.1145/3299869
General Chairs:
Peter Boncz
CWI & Vrije Universiteit Amsterdam, The Netherlands
,
Stefan Manegold
CWI & Universiteit Leiden, The Netherlands
,
Program Chairs:
Anastasia Ailamaki
EPFL, Switzerland
,
Amol Deshpande
University of Maryland, USA
,
Tim Kraska
MIT, USA
Copyright © 2019 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 25 June 2019
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
aggregates in recursion
declarative algorithms in sql
distributed computation
recursive query
Qualifiers
- research-article
Conference

Acceptance Rates
SIGMOD '19 Paper Acceptance Rate88of430submissions,20%Overall Acceptance Rate785of4,003submissions,20%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 21
  Total Citations
  View Citations
- 619
  Total Downloads
- Downloads (Last 12 months)40
- Downloads (Last 6 weeks)2
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

RaSQL: Greater Power and Performance for Big Data Analytics with Recursive-aggregate-SQL on Spark

SIGMOD '19: Proceedings of the 2019 International Conference on Management of Data

ABSTRACT

References

Cited By

Index Terms

Recommendations

RASQL: A Powerful Language and its System for Big Data Applications

From graphs to tables the design of scalable systems for graph analytics

GeoSparkViz: a scalable geospatial data visualization framework in the apache spark ecosystem