skip to main content
research-article

Smoke: fine-grained lineage at interactive speed

Published:01 February 2018Publication History
Skip Abstract Section

Abstract

Data lineage describes the relationship between individual input and output data items of a workflow and is an integral ingredient for both traditional (e.g., debugging or auditing) and emergent (e.g., explanations or cleaning) applications. The core, long-standing problem that lineage systems need to address---and the main focus of this paper---is to quickly capture lineage across a workflow in order to speed up future queries over lineage. Current lineage systems, however, either incur high lineage capture overheads, high lineage query processing costs, or both. In response, developers resort to manual implementations of applications that, in principal, can be expressed and optimized in lineage terms. This paper describes Smoke, an in-memory database engine that provides both fast lineage capture and lineage query processing. To do so, Smoke tightly integrates the lineage capture logic into physical database operators; stores lineage in efficient lineage representations; and employs optimizations if future lineage queries are known up-front. Our experiments on microbenchmarks and realistic workloads show that Smoke reduces the lineage capture overhead and lineage query costs by multiple orders of magnitude as compared to state-of-the-art alternatives. On real-world applications, we show that Smoke meets the latency requirements of interactive visualizations (e.g., < 150ms) and outperforms hand-written implementations of data profiling primitives.

References

  1. D. Abadi, P. Boncz, S. Harizopoulos, S. Idreos, and S. Madden. The design and implementation of modern column-oriented database systems. Foundations and Trends® in Databases, 5(3):197--280, 2013.Google ScholarGoogle ScholarCross RefCross Ref
  2. P. Agrawal, O. Benjelloun, A. D. Sarma, C. Hayworth, S. Nabar, T. Sugihara, and J. Widom. Trio: A system for data, uncertainty, and lineage. In VLDB, pages 1151--1154, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. S. Assadi, S. Khanna, Y. Li, and V. Tannen. Algorithms for provisioning queries and analytics. CoRR, abs/1512.06143, 2015.Google ScholarGoogle Scholar
  4. L. Battle, R. Chang, J. Heer, and M. Stonebraker. Position statement: The case for a visualization performance benchmark. In DSIA, 2017.Google ScholarGoogle ScholarCross RefCross Ref
  5. D. Bhagwat, L. Chiticariu, W. C. Tan, and G. Vijayvargiya. An annotation management system for relational databases. In VLDB, pages 900--911, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. P. Buneman, S. Khanna, and W. C. Tan. Why and where: A characterization of data provenance. In ICDT, pages 316--330, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. S. K. Card, G. G. Robertson, and J. D. Mackinlay. The information visualizer, an information workspace. In CHI, pages 181--186, 1991. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. A. Chalamalla, I. F. Ilyas, M. Ouzzani, and P. Papotti. Descriptive and prescriptive data cleaning. In SIGMOD, pages 445--456, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. S. Chaudhuri, P. Ganesan, and S. Sarawagi. Factorizing complex predicates in queries to exploit indexes. In SIGMOD, pages 361--372, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. A. Chen, Y. Wu, A. Haeberlen, B. T. Loo, and W. Zhou. Data provenance at internet scale: architecture, experiences, and the road ahead. In CIDR, 2017.Google ScholarGoogle Scholar
  11. J. Cheney, L. Chiticariu, and W. C. Tan. Provenance in databases: Why, how, and where. Foundations and Trends® in Databases, 1(4):379--474, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. L. Chiticariu, W. C. Tan, and G. Vijayvargiya. Dbnotes: A post-it system for relational databases based on provenance. In SIGMOD, pages 942--944, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Z. Chothia, J. Liagouris, F. McSherry, and T. Roscoe. Explaining outputs in modern data analytics. PVLDB, 9(12):1137--1148, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Crossfilter. http://square.github.io/crossfilter/, 2015.Google ScholarGoogle Scholar
  15. Y. Cui, J. Widom, and J. L. Wiener. Tracing the lineage of view data in a warehousing environment. TODS, 25(2):179--227, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. D. Deutch, N. Frost, and A. Gilad. Provenance for natural language queries. PVLDB, 10(5):577--588, 2017. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. D. Deutch, Z. G. Ives, T. Milo, and V. Tannen. Caravan: Provisioning for what-if analysis. In CIDR, 2013.Google ScholarGoogle Scholar
  18. K. Dursun, C. Binnig, U. Cetintemel, and T. Kraska. Revisiting reuse in main memory database systems. In SIGMOD, pages 1275--1289, 2017. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. EU GDPR. https://www.eugdpr.org/.Google ScholarGoogle Scholar
  20. F. Faerber, A. Kemper, P.-A. Larson, J. Levandoski, T. Neumann, and A. Pavlo. Main memory database systems. Foundations and Trends® in Databases, 8(1-2):1--130, 2017.Google ScholarGoogle ScholarCross RefCross Ref
  21. Facebook folly. http://bit.ly/fbfolly, 2017.Google ScholarGoogle Scholar
  22. B. Glavic and G. Alonso. Perm: Processing provenance and data on the same data model through query rewriting. In ICDE, pages 174--185, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. J. Gray, S. Chaudhuri, A. Bosworth, A. Layman, D. Reichart, M. Venkatrao, F. Pellow, and H. Pirahesh. Data cube: A relational aggregation operator generalizing group-by, cross-tab, and sub-totals. Data Mining and Knowledge Discovery, 1(1):29--53, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. T. J. Green and V. Tannen. The semiring franework for database provenance. In PODS, pages 93--99, 2017. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. D. Haas, S. Krishnan, J. Wang, M. J. Franklin, and E. Wu. Wisteria: Nurturing scalable data cleaning infrastructure. PVLDB, 8(12):2004--2007, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. N. Hanusse, S. Maabout, and R. Tofan. Revisiting the partial data cube materialization. In ADBIS, pages 70--83, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. J. Heer, M. Agrawala, and W. Willett. Generalized selection via interactive query relaxation. In CHI, pages 959--968, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. R. Ikeda. Provenance In Data-Oriented Workflows. PhD thesis, Stanford University, 2012.Google ScholarGoogle Scholar
  29. R. Ikeda, H. Park, and J. Widom. Provenance for generalized map and reduce workflows. In CIDR, 2011.Google ScholarGoogle Scholar
  30. R. Ikeda and J. Widom. Panda: A system for provenance and data. Data Engineering Bulletin, 33(3):42--49, 2010.Google ScholarGoogle Scholar
  31. M. Interlandi, K. Shah, S. D. Tetali, M. A. Gulzar, S. Yoo, M. Kim, T. Millstein, and T. Condie. Titian: Data provenance support in spark. PVLDB, 9(3):216--227, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. R. Johnson, V. Raman, R. Sidle, and G. Swart. Row-wise parallel predicate evaluation. PVLDB, 1(1):622--634, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. U. Jugel, Z. Jerzak, G. Hackenbroich, and V. Markl. M4: A visualization-oriented time series data aggregation. PVLDB, 7(10):797--808, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. G. Karvounarakis, Z. G. Ives, and V. Tannen. Querying data provenance. In SIGMOD, pages 951--962, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. M. S. Kester, M. Athanassoulis, and S. Idreos. Access path selection in main-memory optimized data systems: Should i scan or should i probe? In SIGMOD, pages 715--730, 2017. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. A. Kumar, M. Boehm, and J. Yang. Data management in machine learning: Challenges, techniques, and systems. In SIGMOD, pages 1717--1722, 2017. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. L. Lins, J. T. Klosowski, and C. Scheidegger. Nanocubes for real-time exploration of spatiotemporal datasets. EuroVis, 19(12):2456--2465, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Z. Liu and J. Heer. The effects of interactive latency on exploratory visual analysis. TVCG, 20(12):2122--2131, 2014.Google ScholarGoogle ScholarCross RefCross Ref
  39. Z. Liu, B. Jiang, and J. Heer. imMens: Real-time visual querying of big data. Computer Graphics Forum, 32(3pt4):421--430, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. D. Logothetis, S. De, and K. Yocum. Scalable lineage capture for debugging disc analytics. In SoCC, pages 17:1--17:15, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. R. Miller. Response time in man-computer conversational transactions. In AFIPS, pages 267--277, 1968. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. T. Neumann. Efficiently compiling efficient query plans for modern hardware. PVLDB, 4(9):539--550, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. T. Neumann and V. Leis. Compiling database queries into machine code. Data Engineering Bulletin, 37(1):3--11, 2014.Google ScholarGoogle Scholar
  44. X. Niu, R. Kapoor, B. Glavic, D. Gawlick, Z. H. Liu, V. Krishnaswamy, and V. Radhakrishnan. Provenance-aware query optimization. In ICDE, pages 473--484, 2017.Google ScholarGoogle ScholarCross RefCross Ref
  45. Airline On-Time Performance. http://stat-computing.org/dataexpo/2009/the-data.html.Google ScholarGoogle Scholar
  46. Airline On-Time Statistics and Delay Causes. https://www.transtats.bts.gov/OT_Delay/OT_DelayCause1.asp.Google ScholarGoogle Scholar
  47. C. A. Pahins, S. A. Stephens, C. Scheidegger, and J. L. Comba. Hashedcubes: Simple, low memory, real-time visual exploration of big data. TVCG, 23(1):671--680, 2017. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. T. Papenbrock, T. Bergmann, M. Finke, J. Zwiener, and F. Naumann. Data profiling with metanome. PVLDB, 8(12):1860--1863, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Physician Compare National. https://data.medicare.gov/data/physician-compare.Google ScholarGoogle Scholar
  50. F. Psallidas and E. Wu. Smoke: Fine-grained lineage at interactive speed. ArXiv e-prints, abs/1801.07237, 2018.Google ScholarGoogle Scholar
  51. T. Rekatsinas, X. Chu, I. F. Ilyas, and C. Ré. Holoclean: Holistic data repairs with probabilistic inference. PVLDB, 10(11):1190--1201, 2017. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. A. Rheinländer, U. Leser, and G. Graefe. Optimization of complex dataflows with user-defined functions. CSUR, 50(3):38:1--38:39, 2017. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. S. Roy, L. Orr, and D. Suciu. Explaining query answers with explanation-ready databases. PVLDB, 9(4):348--359, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. B. Shneiderman. The eyes have it: A task by data type taxonomy for information visualizations. In Symposium on Visual Languages, pages 336--343, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. P. Terlecki, F. Xu, M. Shaw, V. Kim, and R. Wesley. On improving user response times in tableau. In SIGMOD, pages 1695--1706, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. S. Thirumuruganathan, L. Berti-Equille, M. Ouzzani, J.-A. Quiane-Ruiz, and N. Tang. Uguide: User-guided discovery of fd-detectable errors. In SIGMOD, pages 1385--1397, 2017. Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. E. Wu and S. Madden. Scorpion: Explaining away outliers in aggregate queries. PVLDB, 6(8):553--564, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. E. Wu, S. Madden, and M. Stonebraker. A demonstration of dbwipes: clean as you query. PVLDB, 5(12):1894--1897, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. E. Wu, S. Madden, and M. Stonebraker. Subzero: a fine-grained lineage system for scientific databases. In ICDE, pages 865--876, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. E. Wu, F. Psallidas, Z. Miao, H. Zhang, L. Rettig, Y. Wu, and T. Sellam. Combining design and performance in a data visualization management system. In CIDR, 2017.Google ScholarGoogle Scholar
  61. Z. Zhang, E. R. Sparks, and M. J. Franklin. Diagnosing machine learning pipelines with fine-grained lineage. In HPDC, pages 143--153, 2017. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Smoke: fine-grained lineage at interactive speed
      Index terms have been assigned to the content through auto-classification.

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image Proceedings of the VLDB Endowment
        Proceedings of the VLDB Endowment  Volume 11, Issue 6
        February 2018
        210 pages
        ISSN:2150-8097
        Issue’s Table of Contents

        Publisher

        VLDB Endowment

        Publication History

        • Published: 1 February 2018
        Published in pvldb Volume 11, Issue 6

        Qualifiers

        • research-article

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader