Abstract
Data lineage describes the relationship between individual input and output data items of a workflow and is an integral ingredient for both traditional (e.g., debugging or auditing) and emergent (e.g., explanations or cleaning) applications. The core, long-standing problem that lineage systems need to address---and the main focus of this paper---is to quickly capture lineage across a workflow in order to speed up future queries over lineage. Current lineage systems, however, either incur high lineage capture overheads, high lineage query processing costs, or both. In response, developers resort to manual implementations of applications that, in principal, can be expressed and optimized in lineage terms. This paper describes Smoke, an in-memory database engine that provides both fast lineage capture and lineage query processing. To do so, Smoke tightly integrates the lineage capture logic into physical database operators; stores lineage in efficient lineage representations; and employs optimizations if future lineage queries are known up-front. Our experiments on microbenchmarks and realistic workloads show that Smoke reduces the lineage capture overhead and lineage query costs by multiple orders of magnitude as compared to state-of-the-art alternatives. On real-world applications, we show that Smoke meets the latency requirements of interactive visualizations (e.g., < 150ms) and outperforms hand-written implementations of data profiling primitives.
- D. Abadi, P. Boncz, S. Harizopoulos, S. Idreos, and S. Madden. The design and implementation of modern column-oriented database systems. Foundations and Trends® in Databases, 5(3):197--280, 2013.Google ScholarCross Ref
- P. Agrawal, O. Benjelloun, A. D. Sarma, C. Hayworth, S. Nabar, T. Sugihara, and J. Widom. Trio: A system for data, uncertainty, and lineage. In VLDB, pages 1151--1154, 2006. Google ScholarDigital Library
- S. Assadi, S. Khanna, Y. Li, and V. Tannen. Algorithms for provisioning queries and analytics. CoRR, abs/1512.06143, 2015.Google Scholar
- L. Battle, R. Chang, J. Heer, and M. Stonebraker. Position statement: The case for a visualization performance benchmark. In DSIA, 2017.Google ScholarCross Ref
- D. Bhagwat, L. Chiticariu, W. C. Tan, and G. Vijayvargiya. An annotation management system for relational databases. In VLDB, pages 900--911, 2004. Google ScholarDigital Library
- P. Buneman, S. Khanna, and W. C. Tan. Why and where: A characterization of data provenance. In ICDT, pages 316--330, 2001. Google ScholarDigital Library
- S. K. Card, G. G. Robertson, and J. D. Mackinlay. The information visualizer, an information workspace. In CHI, pages 181--186, 1991. Google ScholarDigital Library
- A. Chalamalla, I. F. Ilyas, M. Ouzzani, and P. Papotti. Descriptive and prescriptive data cleaning. In SIGMOD, pages 445--456, 2014. Google ScholarDigital Library
- S. Chaudhuri, P. Ganesan, and S. Sarawagi. Factorizing complex predicates in queries to exploit indexes. In SIGMOD, pages 361--372, 2003. Google ScholarDigital Library
- A. Chen, Y. Wu, A. Haeberlen, B. T. Loo, and W. Zhou. Data provenance at internet scale: architecture, experiences, and the road ahead. In CIDR, 2017.Google Scholar
- J. Cheney, L. Chiticariu, and W. C. Tan. Provenance in databases: Why, how, and where. Foundations and Trends® in Databases, 1(4):379--474, 2009. Google ScholarDigital Library
- L. Chiticariu, W. C. Tan, and G. Vijayvargiya. Dbnotes: A post-it system for relational databases based on provenance. In SIGMOD, pages 942--944, 2005. Google ScholarDigital Library
- Z. Chothia, J. Liagouris, F. McSherry, and T. Roscoe. Explaining outputs in modern data analytics. PVLDB, 9(12):1137--1148, 2016. Google ScholarDigital Library
- Crossfilter. http://square.github.io/crossfilter/, 2015.Google Scholar
- Y. Cui, J. Widom, and J. L. Wiener. Tracing the lineage of view data in a warehousing environment. TODS, 25(2):179--227, 2000. Google ScholarDigital Library
- D. Deutch, N. Frost, and A. Gilad. Provenance for natural language queries. PVLDB, 10(5):577--588, 2017. Google ScholarDigital Library
- D. Deutch, Z. G. Ives, T. Milo, and V. Tannen. Caravan: Provisioning for what-if analysis. In CIDR, 2013.Google Scholar
- K. Dursun, C. Binnig, U. Cetintemel, and T. Kraska. Revisiting reuse in main memory database systems. In SIGMOD, pages 1275--1289, 2017. Google ScholarDigital Library
- EU GDPR. https://www.eugdpr.org/.Google Scholar
- F. Faerber, A. Kemper, P.-A. Larson, J. Levandoski, T. Neumann, and A. Pavlo. Main memory database systems. Foundations and Trends® in Databases, 8(1-2):1--130, 2017.Google ScholarCross Ref
- Facebook folly. http://bit.ly/fbfolly, 2017.Google Scholar
- B. Glavic and G. Alonso. Perm: Processing provenance and data on the same data model through query rewriting. In ICDE, pages 174--185, 2009. Google ScholarDigital Library
- J. Gray, S. Chaudhuri, A. Bosworth, A. Layman, D. Reichart, M. Venkatrao, F. Pellow, and H. Pirahesh. Data cube: A relational aggregation operator generalizing group-by, cross-tab, and sub-totals. Data Mining and Knowledge Discovery, 1(1):29--53, 1997. Google ScholarDigital Library
- T. J. Green and V. Tannen. The semiring franework for database provenance. In PODS, pages 93--99, 2017. Google ScholarDigital Library
- D. Haas, S. Krishnan, J. Wang, M. J. Franklin, and E. Wu. Wisteria: Nurturing scalable data cleaning infrastructure. PVLDB, 8(12):2004--2007, 2015. Google ScholarDigital Library
- N. Hanusse, S. Maabout, and R. Tofan. Revisiting the partial data cube materialization. In ADBIS, pages 70--83, 2011. Google ScholarDigital Library
- J. Heer, M. Agrawala, and W. Willett. Generalized selection via interactive query relaxation. In CHI, pages 959--968, 2008. Google ScholarDigital Library
- R. Ikeda. Provenance In Data-Oriented Workflows. PhD thesis, Stanford University, 2012.Google Scholar
- R. Ikeda, H. Park, and J. Widom. Provenance for generalized map and reduce workflows. In CIDR, 2011.Google Scholar
- R. Ikeda and J. Widom. Panda: A system for provenance and data. Data Engineering Bulletin, 33(3):42--49, 2010.Google Scholar
- M. Interlandi, K. Shah, S. D. Tetali, M. A. Gulzar, S. Yoo, M. Kim, T. Millstein, and T. Condie. Titian: Data provenance support in spark. PVLDB, 9(3):216--227, 2015. Google ScholarDigital Library
- R. Johnson, V. Raman, R. Sidle, and G. Swart. Row-wise parallel predicate evaluation. PVLDB, 1(1):622--634, 2008. Google ScholarDigital Library
- U. Jugel, Z. Jerzak, G. Hackenbroich, and V. Markl. M4: A visualization-oriented time series data aggregation. PVLDB, 7(10):797--808, 2014. Google ScholarDigital Library
- G. Karvounarakis, Z. G. Ives, and V. Tannen. Querying data provenance. In SIGMOD, pages 951--962, 2010. Google ScholarDigital Library
- M. S. Kester, M. Athanassoulis, and S. Idreos. Access path selection in main-memory optimized data systems: Should i scan or should i probe? In SIGMOD, pages 715--730, 2017. Google ScholarDigital Library
- A. Kumar, M. Boehm, and J. Yang. Data management in machine learning: Challenges, techniques, and systems. In SIGMOD, pages 1717--1722, 2017. Google ScholarDigital Library
- L. Lins, J. T. Klosowski, and C. Scheidegger. Nanocubes for real-time exploration of spatiotemporal datasets. EuroVis, 19(12):2456--2465, 2013. Google ScholarDigital Library
- Z. Liu and J. Heer. The effects of interactive latency on exploratory visual analysis. TVCG, 20(12):2122--2131, 2014.Google ScholarCross Ref
- Z. Liu, B. Jiang, and J. Heer. imMens: Real-time visual querying of big data. Computer Graphics Forum, 32(3pt4):421--430, 2013. Google ScholarDigital Library
- D. Logothetis, S. De, and K. Yocum. Scalable lineage capture for debugging disc analytics. In SoCC, pages 17:1--17:15, 2013. Google ScholarDigital Library
- R. Miller. Response time in man-computer conversational transactions. In AFIPS, pages 267--277, 1968. Google ScholarDigital Library
- T. Neumann. Efficiently compiling efficient query plans for modern hardware. PVLDB, 4(9):539--550, 2011. Google ScholarDigital Library
- T. Neumann and V. Leis. Compiling database queries into machine code. Data Engineering Bulletin, 37(1):3--11, 2014.Google Scholar
- X. Niu, R. Kapoor, B. Glavic, D. Gawlick, Z. H. Liu, V. Krishnaswamy, and V. Radhakrishnan. Provenance-aware query optimization. In ICDE, pages 473--484, 2017.Google ScholarCross Ref
- Airline On-Time Performance. http://stat-computing.org/dataexpo/2009/the-data.html.Google Scholar
- Airline On-Time Statistics and Delay Causes. https://www.transtats.bts.gov/OT_Delay/OT_DelayCause1.asp.Google Scholar
- C. A. Pahins, S. A. Stephens, C. Scheidegger, and J. L. Comba. Hashedcubes: Simple, low memory, real-time visual exploration of big data. TVCG, 23(1):671--680, 2017. Google ScholarDigital Library
- T. Papenbrock, T. Bergmann, M. Finke, J. Zwiener, and F. Naumann. Data profiling with metanome. PVLDB, 8(12):1860--1863, 2015. Google ScholarDigital Library
- Physician Compare National. https://data.medicare.gov/data/physician-compare.Google Scholar
- F. Psallidas and E. Wu. Smoke: Fine-grained lineage at interactive speed. ArXiv e-prints, abs/1801.07237, 2018.Google Scholar
- T. Rekatsinas, X. Chu, I. F. Ilyas, and C. Ré. Holoclean: Holistic data repairs with probabilistic inference. PVLDB, 10(11):1190--1201, 2017. Google ScholarDigital Library
- A. Rheinländer, U. Leser, and G. Graefe. Optimization of complex dataflows with user-defined functions. CSUR, 50(3):38:1--38:39, 2017. Google ScholarDigital Library
- S. Roy, L. Orr, and D. Suciu. Explaining query answers with explanation-ready databases. PVLDB, 9(4):348--359, 2015. Google ScholarDigital Library
- B. Shneiderman. The eyes have it: A task by data type taxonomy for information visualizations. In Symposium on Visual Languages, pages 336--343, 1996. Google ScholarDigital Library
- P. Terlecki, F. Xu, M. Shaw, V. Kim, and R. Wesley. On improving user response times in tableau. In SIGMOD, pages 1695--1706, 2015. Google ScholarDigital Library
- S. Thirumuruganathan, L. Berti-Equille, M. Ouzzani, J.-A. Quiane-Ruiz, and N. Tang. Uguide: User-guided discovery of fd-detectable errors. In SIGMOD, pages 1385--1397, 2017. Google ScholarDigital Library
- E. Wu and S. Madden. Scorpion: Explaining away outliers in aggregate queries. PVLDB, 6(8):553--564, 2013. Google ScholarDigital Library
- E. Wu, S. Madden, and M. Stonebraker. A demonstration of dbwipes: clean as you query. PVLDB, 5(12):1894--1897, 2012. Google ScholarDigital Library
- E. Wu, S. Madden, and M. Stonebraker. Subzero: a fine-grained lineage system for scientific databases. In ICDE, pages 865--876, 2013. Google ScholarDigital Library
- E. Wu, F. Psallidas, Z. Miao, H. Zhang, L. Rettig, Y. Wu, and T. Sellam. Combining design and performance in a data visualization management system. In CIDR, 2017.Google Scholar
- Z. Zhang, E. R. Sparks, and M. J. Franklin. Diagnosing machine learning pipelines with fine-grained lineage. In HPDC, pages 143--153, 2017. Google ScholarDigital Library
Index Terms
- Smoke: fine-grained lineage at interactive speed
Recommendations
Smoke: fine-grained lineage at interactive speed
Data lineage describes the relationship between individual input and output data items of a workflow and is an integral ingredient for both traditional (e.g., debugging or auditing) and emergent (e.g., explanations or cleaning) applications. The core, ...
Smoke: fine-grained lineage at interactive speed
Data lineage describes the relationship between individual input and output data items of a workflow and is an integral ingredient for both traditional (e.g., debugging or auditing) and emergent (e.g., explanations or cleaning) applications. The core, ...
Demonstration of Smoke: A Deep Breath of Data-Intensive Lineage Applications
SIGMOD '18: Proceedings of the 2018 International Conference on Management of DataData lineage is a fundamental type of information that describes the relationships between input and output data items in a workflow. As such, an immense amount of data-intensive applications with logic over the input-output relationships can be ...
Comments