research-article

Smoke: fine-grained lineage at interactive speed

Authors:
Fotis Psallidas

Columbia University

Columbia University
View Profile

,
Eugene Wu

Columbia University

Columbia University
View Profile

Proceedings of the VLDB Endowment Volume 11 Issue 6pp 719–732https://doi.org/10.14778/3184470.3184475

Published:01 February 2018Publication History

Proceedings of the VLDB Endowment

Abstract

Data lineage describes the relationship between individual input and output data items of a workflow and is an integral ingredient for both traditional (e.g., debugging or auditing) and emergent (e.g., explanations or cleaning) applications. The core, long-standing problem that lineage systems need to address---and the main focus of this paper---is to quickly capture lineage across a workflow in order to speed up future queries over lineage. Current lineage systems, however, either incur high lineage capture overheads, high lineage query processing costs, or both. In response, developers resort to manual implementations of applications that, in principal, can be expressed and optimized in lineage terms. This paper describes Smoke, an in-memory database engine that provides both fast lineage capture and lineage query processing. To do so, Smoke tightly integrates the lineage capture logic into physical database operators; stores lineage in efficient lineage representations; and employs optimizations if future lineage queries are known up-front. Our experiments on microbenchmarks and realistic workloads show that Smoke reduces the lineage capture overhead and lineage query costs by multiple orders of magnitude as compared to state-of-the-art alternatives. On real-world applications, we show that Smoke meets the latency requirements of interactive visualizations (e.g., < 150ms) and outperforms hand-written implementations of data profiling primitives.

References

D. Abadi, P. Boncz, S. Harizopoulos, S. Idreos, and S. Madden. The design and implementation of modern column-oriented database systems. Foundations and Trends® in Databases, 5(3):197--280, 2013.Google ScholarCross Ref
P. Agrawal, O. Benjelloun, A. D. Sarma, C. Hayworth, S. Nabar, T. Sugihara, and J. Widom. Trio: A system for data, uncertainty, and lineage. In VLDB, pages 1151--1154, 2006. Google ScholarDigital Library
S. Assadi, S. Khanna, Y. Li, and V. Tannen. Algorithms for provisioning queries and analytics. CoRR, abs/1512.06143, 2015.Google Scholar
L. Battle, R. Chang, J. Heer, and M. Stonebraker. Position statement: The case for a visualization performance benchmark. In DSIA, 2017.Google ScholarCross Ref
D. Bhagwat, L. Chiticariu, W. C. Tan, and G. Vijayvargiya. An annotation management system for relational databases. In VLDB, pages 900--911, 2004. Google ScholarDigital Library
P. Buneman, S. Khanna, and W. C. Tan. Why and where: A characterization of data provenance. In ICDT, pages 316--330, 2001. Google ScholarDigital Library
S. K. Card, G. G. Robertson, and J. D. Mackinlay. The information visualizer, an information workspace. In CHI, pages 181--186, 1991. Google ScholarDigital Library
A. Chalamalla, I. F. Ilyas, M. Ouzzani, and P. Papotti. Descriptive and prescriptive data cleaning. In SIGMOD, pages 445--456, 2014. Google ScholarDigital Library
S. Chaudhuri, P. Ganesan, and S. Sarawagi. Factorizing complex predicates in queries to exploit indexes. In SIGMOD, pages 361--372, 2003. Google ScholarDigital Library
A. Chen, Y. Wu, A. Haeberlen, B. T. Loo, and W. Zhou. Data provenance at internet scale: architecture, experiences, and the road ahead. In CIDR, 2017.Google Scholar
J. Cheney, L. Chiticariu, and W. C. Tan. Provenance in databases: Why, how, and where. Foundations and Trends® in Databases, 1(4):379--474, 2009. Google ScholarDigital Library
L. Chiticariu, W. C. Tan, and G. Vijayvargiya. Dbnotes: A post-it system for relational databases based on provenance. In SIGMOD, pages 942--944, 2005. Google ScholarDigital Library
Z. Chothia, J. Liagouris, F. McSherry, and T. Roscoe. Explaining outputs in modern data analytics. PVLDB, 9(12):1137--1148, 2016. Google ScholarDigital Library
Crossfilter. http://square.github.io/crossfilter/, 2015.Google Scholar
Y. Cui, J. Widom, and J. L. Wiener. Tracing the lineage of view data in a warehousing environment. TODS, 25(2):179--227, 2000. Google ScholarDigital Library
D. Deutch, N. Frost, and A. Gilad. Provenance for natural language queries. PVLDB, 10(5):577--588, 2017. Google ScholarDigital Library
D. Deutch, Z. G. Ives, T. Milo, and V. Tannen. Caravan: Provisioning for what-if analysis. In CIDR, 2013.Google Scholar
K. Dursun, C. Binnig, U. Cetintemel, and T. Kraska. Revisiting reuse in main memory database systems. In SIGMOD, pages 1275--1289, 2017. Google ScholarDigital Library
EU GDPR. https://www.eugdpr.org/.Google Scholar
F. Faerber, A. Kemper, P.-A. Larson, J. Levandoski, T. Neumann, and A. Pavlo. Main memory database systems. Foundations and Trends® in Databases, 8(1-2):1--130, 2017.Google ScholarCross Ref
Facebook folly. http://bit.ly/fbfolly, 2017.Google Scholar
B. Glavic and G. Alonso. Perm: Processing provenance and data on the same data model through query rewriting. In ICDE, pages 174--185, 2009. Google ScholarDigital Library
J. Gray, S. Chaudhuri, A. Bosworth, A. Layman, D. Reichart, M. Venkatrao, F. Pellow, and H. Pirahesh. Data cube: A relational aggregation operator generalizing group-by, cross-tab, and sub-totals. Data Mining and Knowledge Discovery, 1(1):29--53, 1997. Google ScholarDigital Library
T. J. Green and V. Tannen. The semiring franework for database provenance. In PODS, pages 93--99, 2017. Google ScholarDigital Library
D. Haas, S. Krishnan, J. Wang, M. J. Franklin, and E. Wu. Wisteria: Nurturing scalable data cleaning infrastructure. PVLDB, 8(12):2004--2007, 2015. Google ScholarDigital Library
N. Hanusse, S. Maabout, and R. Tofan. Revisiting the partial data cube materialization. In ADBIS, pages 70--83, 2011. Google ScholarDigital Library
J. Heer, M. Agrawala, and W. Willett. Generalized selection via interactive query relaxation. In CHI, pages 959--968, 2008. Google ScholarDigital Library
R. Ikeda. Provenance In Data-Oriented Workflows. PhD thesis, Stanford University, 2012.Google Scholar
R. Ikeda, H. Park, and J. Widom. Provenance for generalized map and reduce workflows. In CIDR, 2011.Google Scholar
R. Ikeda and J. Widom. Panda: A system for provenance and data. Data Engineering Bulletin, 33(3):42--49, 2010.Google Scholar
M. Interlandi, K. Shah, S. D. Tetali, M. A. Gulzar, S. Yoo, M. Kim, T. Millstein, and T. Condie. Titian: Data provenance support in spark. PVLDB, 9(3):216--227, 2015. Google ScholarDigital Library
R. Johnson, V. Raman, R. Sidle, and G. Swart. Row-wise parallel predicate evaluation. PVLDB, 1(1):622--634, 2008. Google ScholarDigital Library
U. Jugel, Z. Jerzak, G. Hackenbroich, and V. Markl. M4: A visualization-oriented time series data aggregation. PVLDB, 7(10):797--808, 2014. Google ScholarDigital Library
G. Karvounarakis, Z. G. Ives, and V. Tannen. Querying data provenance. In SIGMOD, pages 951--962, 2010. Google ScholarDigital Library
M. S. Kester, M. Athanassoulis, and S. Idreos. Access path selection in main-memory optimized data systems: Should i scan or should i probe? In SIGMOD, pages 715--730, 2017. Google ScholarDigital Library
A. Kumar, M. Boehm, and J. Yang. Data management in machine learning: Challenges, techniques, and systems. In SIGMOD, pages 1717--1722, 2017. Google ScholarDigital Library
L. Lins, J. T. Klosowski, and C. Scheidegger. Nanocubes for real-time exploration of spatiotemporal datasets. EuroVis, 19(12):2456--2465, 2013. Google ScholarDigital Library
Z. Liu and J. Heer. The effects of interactive latency on exploratory visual analysis. TVCG, 20(12):2122--2131, 2014.Google ScholarCross Ref
Z. Liu, B. Jiang, and J. Heer. imMens: Real-time visual querying of big data. Computer Graphics Forum, 32(3pt4):421--430, 2013. Google ScholarDigital Library
D. Logothetis, S. De, and K. Yocum. Scalable lineage capture for debugging disc analytics. In SoCC, pages 17:1--17:15, 2013. Google ScholarDigital Library
R. Miller. Response time in man-computer conversational transactions. In AFIPS, pages 267--277, 1968. Google ScholarDigital Library
T. Neumann. Efficiently compiling efficient query plans for modern hardware. PVLDB, 4(9):539--550, 2011. Google ScholarDigital Library
T. Neumann and V. Leis. Compiling database queries into machine code. Data Engineering Bulletin, 37(1):3--11, 2014.Google Scholar
X. Niu, R. Kapoor, B. Glavic, D. Gawlick, Z. H. Liu, V. Krishnaswamy, and V. Radhakrishnan. Provenance-aware query optimization. In ICDE, pages 473--484, 2017.Google ScholarCross Ref
Airline On-Time Performance. http://stat-computing.org/dataexpo/2009/the-data.html.Google Scholar
Airline On-Time Statistics and Delay Causes. https://www.transtats.bts.gov/OT_Delay/OT_DelayCause1.asp.Google Scholar
C. A. Pahins, S. A. Stephens, C. Scheidegger, and J. L. Comba. Hashedcubes: Simple, low memory, real-time visual exploration of big data. TVCG, 23(1):671--680, 2017. Google ScholarDigital Library
T. Papenbrock, T. Bergmann, M. Finke, J. Zwiener, and F. Naumann. Data profiling with metanome. PVLDB, 8(12):1860--1863, 2015. Google ScholarDigital Library
Physician Compare National. https://data.medicare.gov/data/physician-compare.Google Scholar
F. Psallidas and E. Wu. Smoke: Fine-grained lineage at interactive speed. ArXiv e-prints, abs/1801.07237, 2018.Google Scholar
T. Rekatsinas, X. Chu, I. F. Ilyas, and C. Ré. Holoclean: Holistic data repairs with probabilistic inference. PVLDB, 10(11):1190--1201, 2017. Google ScholarDigital Library
A. Rheinländer, U. Leser, and G. Graefe. Optimization of complex dataflows with user-defined functions. CSUR, 50(3):38:1--38:39, 2017. Google ScholarDigital Library
S. Roy, L. Orr, and D. Suciu. Explaining query answers with explanation-ready databases. PVLDB, 9(4):348--359, 2015. Google ScholarDigital Library
B. Shneiderman. The eyes have it: A task by data type taxonomy for information visualizations. In Symposium on Visual Languages, pages 336--343, 1996. Google ScholarDigital Library
P. Terlecki, F. Xu, M. Shaw, V. Kim, and R. Wesley. On improving user response times in tableau. In SIGMOD, pages 1695--1706, 2015. Google ScholarDigital Library
S. Thirumuruganathan, L. Berti-Equille, M. Ouzzani, J.-A. Quiane-Ruiz, and N. Tang. Uguide: User-guided discovery of fd-detectable errors. In SIGMOD, pages 1385--1397, 2017. Google ScholarDigital Library
E. Wu and S. Madden. Scorpion: Explaining away outliers in aggregate queries. PVLDB, 6(8):553--564, 2013. Google ScholarDigital Library
E. Wu, S. Madden, and M. Stonebraker. A demonstration of dbwipes: clean as you query. PVLDB, 5(12):1894--1897, 2012. Google ScholarDigital Library
E. Wu, S. Madden, and M. Stonebraker. Subzero: a fine-grained lineage system for scientific databases. In ICDE, pages 865--876, 2013. Google ScholarDigital Library
E. Wu, F. Psallidas, Z. Miao, H. Zhang, L. Rettig, Y. Wu, and T. Sellam. Combining design and performance in a data visualization management system. In CIDR, 2017.Google Scholar
Z. Zhang, E. R. Sparks, and M. J. Franklin. Diagnosing machine learning pipelines with fine-grained lineage. In HPDC, pages 143--153, 2017. Google ScholarDigital Library

Index Terms

Smoke: fine-grained lineage at interactive speed
1. Information systems
  1. Data management systems
    1. Database management system engines
      1. Database query processing
2. Theory of computation
  1. Theory and algorithms for application domains
    1. Database theory
      1. Database query processing and optimization (theory)

Index terms have been assigned to the content through auto-classification.

Recommendations

Smoke: fine-grained lineage at interactive speed

Data lineage describes the relationship between individual input and output data items of a workflow and is an integral ingredient for both traditional (e.g., debugging or auditing) and emergent (e.g., explanations or cleaning) applications. The core, ...
Read More
Smoke: fine-grained lineage at interactive speed

Data lineage describes the relationship between individual input and output data items of a workflow and is an integral ingredient for both traditional (e.g., debugging or auditing) and emergent (e.g., explanations or cleaning) applications. The core, ...
Read More
Demonstration of Smoke: A Deep Breath of Data-Intensive Lineage Applications
SIGMOD '18: Proceedings of the 2018 International Conference on Management of Data

Data lineage is a fundamental type of information that describes the relationships between input and output data items in a workflow. As such, an immense amount of data-intensive applications with logic over the input-output relationships can be ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
Proceedings of the VLDB Endowment Volume 11, Issue 6
February 2018
210 pages
ISSN:2150-8097
Editors:
Jian Pei
Simon Fraser University
,
Sihem Amer-Yahia
University of Grenoble Alpes, CNRS
Issue’s Table of Contents
Sponsors
In-Cooperation
Publisher
VLDB Endowment
Publication History
- Published: 1 February 2018
Published in pvldb Volume 11, Issue 6
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 6
  Total Citations
  View Citations
- 39
  Total Downloads
- Downloads (Last 12 months)5
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Smoke: fine-grained lineage at interactive speed

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Index Terms

Recommendations

Smoke: fine-grained lineage at interactive speed

Smoke: fine-grained lineage at interactive speed

Demonstration of Smoke: A Deep Breath of Data-Intensive Lineage Applications

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Smoke: fine-grained lineage at interactive speed

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Index Terms

Recommendations

Smoke: fine-grained lineage at interactive speed

Smoke: fine-grained lineage at interactive speed

Demonstration of Smoke: A Deep Breath of Data-Intensive Lineage Applications

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media