ABSTRACT
Big data analytical systems, such as MapReduce, perform aggressive materialization of intermediate job results in order to support fault tolerance. When jobs correspond to exploratory queries submitted by data analysts, these materializations yield a large set of materialized views that we propose to treat as an opportunistic physical design. We present a semantic model for UDFs that enables effective reuse of views containing UDFs along with a rewrite algorithm that provably finds the minimum-cost rewrite under certain assumptions. An experimental study on real-world datasets using our prototype based on Hive shows that our approach can result in dramatic performance improvements.
- S. Agrawal, S. Chaudhuri, and V. Narasayya. Automated selection of materialized views and indexes in SQL databases. In VLDB, 2000. Google ScholarDigital Library
- A. Chandra and P. M. Merlin. Optimal implementation of conjunctive queries in relational databases. In STOC, 1977. Google ScholarDigital Library
- Y. Chen, S. Alspaugh, and R. Katz. Interactive analytical processing in big data systems: A cross-industry study of MapReduce workloads. PVLDB, 5(12), 2012. Google ScholarDigital Library
- Y. Chen, A. Ganapathi, R. Griffith, and R. Katz. The case for evaluating MapReduce performance using workload suites. In MASCOTS, 2011. Google ScholarDigital Library
- DataFu. http://data.linkedin.com/opensource/datafu.Google Scholar
- I. Elghandour and A. Aboulnaga. ReStore: Reusing results of MapReduce jobs. PVLDB, 5(6), 2012. Google ScholarDigital Library
- J. Goldstein and P.-A. Larson. Optimizing queries using materialized views: A practical, scalable solution. In SIGMOD, 2001. Google ScholarDigital Library
- S. Grumbach and L. Tininini. On the content of materialized aggregate views. In PODS, 2000. Google ScholarDigital Library
- H. Hacigümüs, J. Sankaranarayanan, J. Tatemura, J. LeFevre, and N. Polyzotis. Odyssey: A multi-store system for evolutionary analytics. PVLDB, 6(11), 2013. Google ScholarDigital Library
- A. Halevy. Answering queries using views: A survey. VLDBJ, 10(4), 2001. Google ScholarDigital Library
- J. M. Hellerstein, C. Ré, F. Schoppmann, D. Z. Wang, E. Fratkin, A. Gorajek, K. S. Ng, C. Welton, X. Feng, K. Li, et al. The MADlib analytics library: or MAD skills, the SQL. PVLDB, 5(12), 2012. Google ScholarDigital Library
- G. Hjaltason and H. Samet. Index-driven similarity search in metric spaces. TODS, 28(4), 2003. Google ScholarDigital Library
- F. Hueske, M. Peters, M. J. Sax, A. Rheinlander, R. Bergmann, A. Krettek, and K. Tzoumas. Opening the black boxes in data flow optimization. PVLDB, 5(11), 2012. Google ScholarDigital Library
- N. Khoussainova, Y. Kwon, W.-T. Liao, M. Balazinska, W. Gatterbauer, and D. Suciu. Session-based browsing for more effective query reuse. In SSDBM, 2011. Google ScholarDigital Library
- G. Konstantinidis and J. L. Ambite. Scalable query rewriting: a graph-based approach. In SIGMOD, 2011. Google ScholarDigital Library
- J. LeFevre, J. Sankaranarayanan, H. Hacıgümüş, J. Tatemura, and N. Polyzotis. Towards a workload for evolutionary analytics. In SIGMOD Workshop on Data Analytics in the Cloud (DanaC), 2013. Google ScholarDigital Library
- J. LeFevre, J. Sankaranarayanan, H. Hacıgümüş, J. Tatemura, N. Polyzotis, and M. J. Carey. Exploiting opportunistic physical design in large-scale data analytics. CoRR, abs/1303.6609, 2013.Google Scholar
- J. LeFevre, J. Sankaranarayanan, H. Hacıgümüş, J. Tatemura, N. Polyzotis, and M. J. Carey. MISO: Souping up big data query processing with a multistore system. In SIGMOD, 2014. Google ScholarDigital Library
- A. Y. Levy, A. O. Mendelzon, and Y. Sagiv. Answering queries using views (extended abstract). In PODS, 1995. Google ScholarDigital Library
- B. Li, E. Mazur, Y. Diao, A. McGregor, and P. Shenoy. A platform for scalable one-pass analytics using MapReduce. In SIGMOD, 2011. Google ScholarDigital Library
- T. Nykiel, M. Potamias, C. Mishra, G. Kollios, and N. Koudas. MRShare: Sharing across multiple queries in MapReduce. PVLDB, 3(1--2), 2010. Google ScholarDigital Library
- PiggyBank. https://wiki.apache.org/pig/PiggyBank.Google Scholar
- R. Pottinger and A. Halevy. MiniCon: A scalable algorithm for answering queries using views. VLDBJ, 10(2), 2001. Google ScholarDigital Library
- K. Ren, Y. Kwon, M. Balazinska, and B. Howe. Hadoop's adolescence: An analysis of Hadoop usage in scientific workloads. PVLDB, 6(10), 2013. Google ScholarDigital Library
- K. Schnaitter, S. Abiteboul, T. Milo, and N. Polyzotis. On-line index selection for shifting workloads. In ICDE, 2007. Google ScholarDigital Library
- T. Sellis. Multiple-query optimization. TODS, 13(1), 1988. Google ScholarDigital Library
- A. Simitsis, K. Wilkinson, M. Castellanos, and U. Dayal. Optimizing analytic data flows for multiple execution engines. In SIGMOD, 2012. Google ScholarDigital Library
- R. S. Xin, J. Rosen, M. Zaharia, M. J. Franklin, S. Shenker, and I. Stoica. Shark: SQL and rich analytics at scale. In SIGMOD, 2013. Google ScholarDigital Library
- M. Zaharioudakis, R. Cochrane, G. Lapis, H. Pirahesh, and M. Urata. Answering complex SQL queries using automatic summary tables. In SIGMOD, 2000. Google ScholarDigital Library
Index Terms
- Opportunistic physical design for big data analytics
Recommendations
Query Processing Techniques for Big Spatial-Keyword Data
SIGMOD '17: Proceedings of the 2017 ACM International Conference on Management of DataThe widespread use of GPS-enabled cellular devices, i.e., smart phones, led to the popularity of numerous mobile applications, e.g., social networks, micro-blogs, mobile web search, and crowd-powered reviews. These applications generate large amounts of ...
Mobile Big Data Analytics: Research, Practice, and Opportunities
MDM '14: Proceedings of the 2014 IEEE 15th International Conference on Mobile Data Management - Volume 01The rapid expansion of broadband mobile networks by Telecom Operators, has introduced a versatile global infrastructure that internally generates vast amounts of spatio-temporal network-level data (e.g., User id, location, device type, etc.) At the same ...
Comments