Top

Published in:

30-09-2020 | Special Issue Paper

DIFF: a relational interface for large-scale data explanation

Authors: Firas Abuzaid, Peter Kraft, Sahaana Suri, Edward Gan, Eric Xu, Atul Shenoy, Asvin Ananthanarayan, John Sheu, Erik Meijer, Xi Wu, Jeff Naughton, Peter Bailis, Matei Zaharia

Published in: The VLDB Journal | Issue 1/2021

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Off

Abstract

A range of explanation engines assist data analysts by performing feature selection over increasingly high-volume and high-dimensional data, grouping and highlighting commonalities among data points. While useful in diverse tasks such as user behavior analytics, operational event processing, and root-cause analysis, today’s explanation engines are designed as stand-alone data processing tools that do not interoperate with traditional, SQL-based analytics workflows; this limits the applicability and extensibility of these engines. In response, we propose the DIFF operator, a relational aggregation operator that unifies the core functionality of these engines with declarative relational query processing. We implement both single-node and distributed versions of the DIFF operator in MB SQL, an extension of MacroBase, and demonstrate how DIFF can provide the same semantics as existing explanation engines while capturing a broad set of production use cases in industry, including at Microsoft and Facebook. Additionally, we illustrate how this declarative approach to data explanation enables new logical and physical query optimizations. We evaluate these optimizations on several real-world production applications and find that DIFF in MB SQL can outperform state-of-the-art engines by up to an order of magnitude.

previous article Autoscaling tiered cloud storage in Anna

next article Interactive checks for coordination avoidance

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Available only for authorised users

To keep ANTI DIFF consistent with

https://static-content.springer.com/image/art%3A10.1007%2Fs00778-020-00633-6/MediaObjects/778_2020_633_Figbc_HTML.gif

https://static-content.springer.com/image/art%3A10.1007%2Fs00778-020-00633-6/MediaObjects/778_2020_633_Figbd_HTML.gif

, we also prune all explanations with no support in R.

Our implementation is open source and available at https://github.com/stanford-futuredata/macrobase.

https://support.censys.io/hc/en-us/articles/360038761891-Research-Access-to-Censys-Data.

https://www.cms.gov/OpenPayments/Explore-the-Data/Data-Overview.html.

https://bitbucket.org/xlwang/dataxray-source-code.

Abiteboul, S., Hull, R., Vianu, V.: Foundations of Databases: The Logical Level. Addison-Wesley Longman Publishing Co. Inc, Boston (1995)

Agarwal, R., Srikant, R., et al.: Fast algorithms for mining association rules. In: VLDB, pp. 487–499 (1994)

Antonakakis, M., April, T., Bailey, M., Bernhard, M., Bursztein, E., Cochran, J., Durumeric, Z., Halderman, J.A., Invernizzi, L., Kallitsis, M., Kumar, D., Lever, C., Ma, Z., Mason, J., Menscher, D., Seaman, C., Sullivan, N., Thomas, K., Zhou, Y.: Understanding the mirai botnet. In: USENIX Security (2017). https://www.usenix.org/conference/usenixsecurity17/technical-sessions/presentation/antonakakis

Armbrust, M., et al.: Spark sql: relational data processing in spark. In: SIGMOD, pp. 1383–1394. ACM (2015)

Avnur, R., Hellerstein, J.M.: Eddies: continuously adaptive query processing. In: SIGMOD, vol. 29, pp. 261–272. ACM (2000)

Ayres, J., et al.: Sequential pattern mining using a bitmap representation. In: KDD, pp. 429–435. ACM (2002)

Babu, S., Bizarro, P., DeWitt, D.: Proactive re-optimization. In: SIGMOD, pp. 107–118. ACM (2005)

Bailis, P., Gan, E., Madden, S., Narayanan, D., Rong, K., Suri, S.: Macrobase: prioritizing attention in fast data. In: SIGMOD, pp. 541–556. ACM (2017)

Bailis, P., et al.: Prioritizing attention in fast data: principles and promise. In: CIDR. Google Scholar (2017)

10.

Baralis, E., Cerquitelli, T., Chiusano, S.: Index support for frequent itemset mining in a relational dbms. In: ICDE, pp. 754–765. IEEE (2005)

11.

Baralis, E., Cerquitelli, T., Chiusano, S.: Imine: index support for item set mining. IEEE Trans. Knowl. Data Eng. 21(4), 493–506 (2009)CrossRef

12.

Baraniuk, R.G.: Compressive sensing [lecture notes]. IEEE Signal Process. Mag. 24(4), 118–121 (2007)CrossRef

13.

Benjamini, Y., Yekutieli, D.: The control of the false discovery rate in multiple testing under dependency. Ann. Stat. 29, 1165–1188 (2001)MathSciNetCrossRef

14.

Bittorf, M., et al.: Impala: a modern, open-source SQL engine for hadoop. In: CIDR (2015)

15.

Burdick, D., Calimlim, M., Gehrke, J.: Mafia: a maximal frequent itemset algorithm for transactional databases. In: ICDE, pp. 443–452. IEEE (2001)

16.

Chambi, S., et al.: Better bitmap performance with roaring bitmaps. Softw. Pract. Exp. 46(5), 709–719 (2016)CrossRef

17.

Chambi, S., et al.: Optimizing druid with roaring bitmaps. In: IDEAS, pp. 77–86. ACM (2016)

18.

Chaudhuri, S.: An overview of query optimization in relational systems. In: PODS, pp. 34–43. ACM (1998)

19.

Chen, L., et al.: Towards linear algebra over normalized data. PVLDB 10(11), 1214–1225 (2017)

20.

Dean, J., Barroso, L.A.: The tail at scale. Commun. ACM 56, 74–80 (2013)CrossRef

21.

Deshpande, A., et al.: Adaptive query processing. Found. Trends Databases 1(1), 1–140 (2007)CrossRef

22.

Durumeric, Z., et al.: The matter of heartbleed. In: IMC, pp. 475–488. ACM (2014)

23.

Durumeric, Z., et al.: A search engine backed by Internet-wide scanning. In: SIGSAC, pp. 542–553. ACM (2015)

24.

Fagin, R., et al.: Efficient implementation of large-scale multi-structural databases. In: VLDB, pp. 958–969. VLDB Endowment (2005)

25.

Fagin, R., et al.: Multi-structural databases. In: PODS, pp. 184–195. ACM (2005)

26.

Fang, W., et al.: Frequent itemset mining on graphics processors. In: DaMoN, pp. 34–42. ACM (2009)

27.

Fournier-Viger, P., et al.: The SPMF open-source data mining library version 2. In: Joint European conference on machine learning and knowledge discovery in databases, pp. 36–40. Springer (2016)

28.

Graefe, G., McKenna, W.J.: The volcano optimizer generator: extensibility and efficient search. In: ICDE, pp. 209–218. IEEE (1993)

29.

Gray, J., et al.: Data cube: a relational aggregation operator generalizing group-by, cross-tab, and sub-totals. Data Min. Knowl. Discov. 1(1), 29–53 (1997)CrossRef

30.

Greenberg, A., et al.: The cost of a cloud: research problems in data center networks. ACM SIGCOMM Comput. Commun. Rev. 39(1), 68–73 (2008)CrossRef

31.

Guyon, I., Elisseeff, A.: An introduction to variable and feature selection. J. Mach. Learn. Res. 3(Mar), 1157–1182 (2003)MATH

32.

Hall, M.A.: Correlation-based feature selection of discrete and numeric class machine learning. Working Paper Series (2000)

33.

Hellerstein, J.M., Stonebraker, M.: Readings in database systems. MIT press (2005)

34.

Hellerstein, J.M., et al.: Architecture of a database system. Found. Trends® Databases 1(2), 141–259 (2007)

35.

Hoi, S.C., et al.: Online feature selection for mining big data. In: BigMine, pp. 93–100. ACM (2012)

36.

Ilyas, I.F., et al.: Cords: automatic discovery of correlations and soft functional dependencies. In: SIGMOD, pp. 647–658. ACM (2004)

37.

Ioannidis, Y.E., Christodoulakis, S.: On the Propagation of Errors in the Size of Join Results, vol. 20. ACM, New York (1991)

38.

Khoussainova, N., Balazinska, M., Suciu, D.: Perfxplain: debugging mapreduce job performance. PVLDB 5(7), 598–609 (2012)

39.

Kimball, R., Ross, M.: The Data Warehouse Toolkit: The Complete Guide to Dimensional Modeling. Wiley, Hoboken (2011)

40.

Konda, P., et al.: Feature selection in enterprise analytics: a demonstration using an r-based data analytics system. PVLDB 6(12), 1306–1309 (2013)

41.

Kumar, A.: Learning over joins. Ph.D. thesis, The University of Wisconsin-Madison (2016)

42.

Kumar, A., Naughton, J., Patel, J.M.: Learning generalized linear models over normalized data. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pp. 1969–1984. ACM (2015)

43.

Kumar, A., et al.: To join or not to join?: thinking twice about joins before feature selection. In: SIGMOD, pp. 19–34. ACM (2016)

44.

Lamb, A., et al.: The vertica analytic database: C-store 7 years later. VLDB 5(12), 1790–1801 (2012)

45.

Leskovec, J., et al.: Mining of Massive Datasets. Cambridge University Press, Cambridge (2014)CrossRef

46.

Li, H., et al.: Pfp: parallel fp-growth for query recommendation. In: RecSys, pp. 107–114. ACM (2008)

47.

Li, J., et al.: Feature selection: a data perspective. ACM Comput. Surv. (CSUR) 50(6), 94 (2017)

48.

Meliou, A., Roy, S., Suciu, D.: Causality and explanations in databases. PVLDB 7(13), 1715–1716 (2014)

49.

Melnik, S., et al.: Dremel: interactive analysis of web-scale datasets. PVLDB 3(1–2), 330–339 (2010)

50.

Meng, X., et al.: Mllib: machine learning in apache spark. J. Mach. Learn. Res. 17(1), 1235–1241 (2016)MathSciNetMATH

51.

Neumann, T., Radke, B.: Adaptive optimization of very large join queries. In: SIGMOD, pp. 677–692. ACM (2018)

52.

Ngo, H.Q., et al.: Worst-case optimal join algorithms. J. ACM: JACM 65(3), 16 (2018)MathSciNetCrossRef

53.

O’Neil, P., Quass, D.: Improved query performance with variant indexes. In: SIGMOD, vol. 26, pp. 38–49. ACM (1997)

54.

Pagh, A., Pagh, R.: Scalable computation of acyclic joins. In: PODS, pp. 225–232. ACM (2006)

55.

Rounds, E.: A combined nonparametric approach to feature selection and binary decision tree design. Pattern Recogn. 12(5), 313–317 (1980)CrossRef

56.

Roy, S., Suciu, D.: A formal approach to finding explanations for database queries. In: SIGMOD, pp. 1579–1590. ACM (2014)

57.

Roy, S., et al.: Perfaugur: robust diagnostics for performance anomalies in cloud services. In: 2015 IEEE 31st International Conference on Data Engineering (ICDE), pp. 1167–1178. IEEE (2015)

58.

Rupert Jr., G., et al.: Simultaneous Statistical Inference. Springer, Berlin (2012)

59.

Saeys, Y., Inza, I., Larrañaga, P.: A review of feature selection techniques in bioinformatics. Bioinformatics 23(19), 2507–2517 (2007)CrossRef

60.

Schuh, S., Chen, X., Dittrich, J.: An experimental comparison of thirteen relational equi-joins in main memory. In: SIGMOD, pp. 1961–1976. ACM (2016)

61.

Selinger, P.G., Astrahan, M.M., Chamberlin, D.D., Lorie, R.A., Price, T.G.: Access path selection in a relational database management system. In: Proceedings of the 1979 ACM SIGMOD International Conference on Management of Data, pp. 23–34 (1979)

62.

Shang, X., Sattler, KU., Geist, I.: SQL based frequent pattern mining with FP-growth. In: Seipel, D., Hanus, M., Geske, U., Bartenstein, O. (eds.) Applications of Declarative Programming and Knowledge Management. INAP 2004, WLP 2004. Lecture Notes in Computer Science, vol. 3392. Springer, Berlin, Heidelberg (2005). https://doi.org/10.1007/11415763_3

63.

Stonebraker, M., et al.: C-store: a column-oriented dbms. In: VLDB, pp. 553–564. VLDB Endowment (2005)

64.

Wang, X., et al.: Data x-ray: a diagnostic tool for data errors. In: SIGMOD, pp. 1231–1245. ACM (2015)

65.

Willard, D.E.: Applications of range query theory to relational data base join and selection operations. J. Comput. Syst. Sci. 52(1), 157–169 (1996)MathSciNetCrossRef

66.

Wu, E., Madden, S.: Scorpion: explaining away outliers in aggregate queries. PVLDB 6(8), 553–564 (2013)

67.

Yang, F., et al.: Druid: A real-time analytical data store. In: SIGMOD, pp. 157–168. ACM (2014)

68.

Yoon, D.Y., Niu, N., Mozafari, B.: Dbsherlock: a performance diagnostic tool for transactional databases. In: SIGMOD, pp. 1599–1614. ACM (2016)

69.

Zaharia, M., et al.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: NSDI, pp. 2–2. USENIX Association (2012)

70.

Zhang, F., Zhang, Y., Bakos, J.: Gpapriori: Gpu-accelerated frequent itemset mining. In: 2011 IEEE International Conference on Cluster Computing (CLUSTER), pp. 590–594. IEEE (2011)

Title: DIFF: a relational interface for large-scale data explanation
Authors: Firas Abuzaid
Peter Kraft
Sahaana Suri
Edward Gan
Eric Xu
Atul Shenoy
Asvin Ananthanarayan
John Sheu
Erik Meijer
Xi Wu
Jeff Naughton
Peter Bailis
Matei Zaharia
Publication date: 30-09-2020
Publisher: Springer Berlin Heidelberg
Published in: The VLDB Journal / Issue 1/2021
Print ISSN: 1066-8888
Electronic ISSN: 0949-877X
DOI: https://doi.org/10.1007/s00778-020-00633-6

Springer Professional

Abstract

Please log in to get access to your license.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Other articles of this Issue 1/2021

LineageChain: a fine-grained, secure and efficient data provenance system for blockchains

Guest Editorial: Special issue on VLDB 2019

Querying subjective data

Gossip-based visibility control for high-performance geo-distributed transactions

Autoscaling tiered cloud storage in Anna

Interactive checks for coordination avoidance

Premium Partner