skip to main content
research-article

Ananke: a streaming framework for live forward provenance

Published:01 November 2020Publication History
Skip Abstract Section

Abstract

Data streaming enables online monitoring of large and continuous event streams in Cyber-Physical Systems (CPSs). In such scenarios, fine-grained backward provenance tools can connect streaming query results to the source data producing them, allowing analysts to study the dependency/causality of CPS events. While CPS monitoring commonly produces many events, backward provenance does not help prioritize event inspection since it does not specify if an event's provenance could still contribute to future results.

To cover this gap, we introduce Ananke, a framework to extend any fine-grained backward provenance tool and deliver a live bipartite graph of fine-grained forward provenance. With Ananke, analysts can prioritize the analysis of provenance data based on whether such data is still potentially being processed by the monitoring queries. We prove our solution is correct, discuss multiple implementations, including one leveraging streaming APIs for parallel analysis, and show Ananke results in small overheads, close to those of existing tools for fine-grained backward provenance.

References

  1. Tyler Akidau, Alex Balikov, Kaya Bekiroğlu, Slava Chernyak, Josh Haberman, Reuven Lax, Sam McVeety, Daniel Mills, Paul Nordstrom, and Sam Whittle. 2013. MillWheel: fault-tolerant stream processing at internet scale. Proceedings of the VLDB Endowment 6, 11 (2013), 1033--1044. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Tyler Akidau, Robert Bradshaw, Craig Chambers, Slava Chernyak, Rafael J Fer Andez-Moctezuma, Reuven Lax, Sam Mcveety, Daniel Mills, Frances Perry, Eric Schmidt, and Sam Whittle Google. 2015. The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing. VLDB 8, 12 (2015), 1792--1803. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Apache. 2020. Beam. Retrieved November 12, 2020 from https://beam.apache.org/Google ScholarGoogle Scholar
  4. Apache. 2020. Heron. Retrieved November 12, 2020 from https://heron.incubator.apache.org/Google ScholarGoogle Scholar
  5. Apache. 2020. Storm. Retrieved November 12, 2020 from http://storm.apache.org/Google ScholarGoogle Scholar
  6. Arvind Arasu, Mitch Cherniack, Eduardo Galvez, David Maier, Anurag S. Maskey, Esther Ryvkina, Michael Stonebraker, and Richard Tibbetts. 2004. Linear Road: A Stream Data Management Benchmark. In Proceedings of the Thirtieth International Conference on Very Large Data Bases - Volume 30 (Toronto, Canada) (VLDB '04). VLDB Endowment, Toronto, Canada, 480--491. http://dl.acm.org/citation.cfm?id=1316689.1316732 Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Leilani Battle, Danyel Fisher, Robert DeLine, Mike Barnett, Badrish Chandramouli, and Jonathan Goldstein. 2016. Making Sense of Temporal Queries with Interactive Visualization. In Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems - CHI '16. ACM Press, Santa Clara, California, USA, 5433--5443. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Paris Carbone, Asterios Katsifodimos, Stephan Ewen, Volker Markl, Seif Haridi, and Kostas Tzoumas. 2015. Apache Flink: Stream and batch processing in a single engine. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering 36, 4 (2015), 28--38.Google ScholarGoogle Scholar
  9. Badrish Chandramouli, Jonathan Goldstein, Mike Barnett, Robert Deline, Danyel Fisher, John C Platt, James F Terwilliger, and John Wernsing. 2015. Trill : A High-Performance Incremental Query Processor for Diverse Analytics. VLDB - Very Large Data Bases 8, 4 (2015), 401--412. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Ming-Fang Chang, John Lambert, Patsorn Sangkloy, Jagjeet Singh, Slawomir Bak, Andrew Hartnett, De Wang, Peter Carr, Simon Lucey, Deva Ramanan, et al. 2019. Argoverse: 3D Tracking and Forecasting With Rich Maps. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 8740--8749.Google ScholarGoogle ScholarCross RefCross Ref
  11. James Cheney, Laura Chiticariu, and Wang-Chiew Tan. 2007. Provenance in Databases: Why, How, and Where. Foundations and Trends in Databases 1, 4 (2007), 379--474. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Daniel Crawl, Jianwu Wang, and Ilkay Altintas. 2011. Provenance for MapReduce-Based Data-Intensive Workflows. In Proceedings of the 6th Workshop on Workflows in Support of Large-Scale Science (Seattle, Washington, USA) (WORKS '11). Association for Computing Machinery, New York, NY, USA, 21--30. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Yingwei Cui, Jennifer Widom, and Janet L. Wiener. 2000. Tracing the Lineage of View Data in a Warehousing Environment. ACM Transactions on Database Systems 25, 2 (June 2000), 179--227. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Wim De Pauw, Mihai Leţia, Buğra Gedik, Henrique Andrade, Andy Frenkiel, Michael Pfeifer, and Daby Sow. 2010. Visual Debugging for Stream Processing Applications. In Runtime Verification, Howard Barringer, Ylies Falcone, Bernd Finkbeiner, Klaus Havelund, Insup Lee, Gordon Pace, Grigore Roşu, Oleg Sokolsky, and Nikolai Tillmann (Eds.). Vol. 6418. Springer Berlin Heidelberg, Berlin, Heidelberg, 18--35. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Katarina Nielsen Dominiak and Anders Ringgaard Kristensen. 2017. Prioritizing alarms from sensor-based detection models in livestock production - A review on model performance and alarm reducing methods. Computers and Electronics in Agriculture 133 (2017), 46 -- 67. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Romaric Duvignau, Vincenzo Gulisano, Marina Papatriantafilou, and Vladimir Savic. 2019. Streaming Piecewise Linear Approximation for Efficient Data Management in Edge Computing. In Proceedings of the 34th ACM/SIGAPP Symposium on Applied Computing - SAC '19. ACM Press, Limassol, Cyprus, 593--596. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Bugra Gedik, Henrique Andrade, Kun-Lung Wu, Philip S. Yu, and Myungcheol Doo. 2008. SPADE: The Systems Declarative Stream Processing Engine. In Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data (SIGMOD '08). Association for Computing Machinery, New York, NY, USA, 1123--1134. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. GitHub. 2020. Ananke Implementation. Retrieved November 12, 2020 from https://github.com/dmpalyvos/anankeGoogle ScholarGoogle Scholar
  19. Boris Glavic, Kyumars Sheykh Esmaili, Peter M. Fischer, and Nesime Tatbul. 2014. Efficient Stream Provenance via Operator Instrumentation. ACM Trans. Internet Technol. 14, 1, Article 7 (Aug. 2014), 26 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Boris Glavic, Kyumars Sheykh Esmaili, Peter Michael Fischer, and Nesime Tatbul. 2013. Ariadne: Managing Fine-Grained Provenance on Data Streams. In Proceedings of the 7th ACM International Conference on Distributed Event-Based Systems (Arlington, Texas, USA) (DEBS '13). Association for Computing Machinery, New York, NY, USA, 39--50. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. HardKernel. 2020. Odroid-XU4. Retrieved November 12, 2020 from http://www.hardkernel.comGoogle ScholarGoogle Scholar
  22. Bastian Havers, Romaric Duvignau, Hannaneh Najdataei, Vincenzo Gulisano, Marina Papatriantafilou, and Ashok Chaitanya Koppisetty. 2020. DRIVEN: A framework for efficient Data Retrieval and clustering in Vehicular Networks. Future Generation Computer Systems 107 (2020), 1--17.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Melanie Herschel, Ralf Diestelkämper, and Houssem Ben Lahmar. 2017. A Survey on Provenance: What for? What Form? What From? VLDB Journal 26, 6 (2017), 881--906. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Mohammad Rezwanul Huq, Andreas Wombacher, and Peter M.G. Apers. 2011. Adaptive Inference of Fine-grained Data Provenance to Achieve High Accuracy at Lower Storage Costs. IEEE Computer Society, USA, 202--209. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Jeong-Hyon Hwang, Ugur Cetintemel, and Stan Zdonik. 2007. Fast and Reliable Stream Processing over Wide Area Networks. In 2007 IEEE 23rd International Conference on Data Engineering Workshop. 604--613. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Jin Li, Kristin Tufte, Vladislav Shkapenyuk, Vassilis Papadimos, Theodore Johnson, and David Maier. 2008. Out-of-order processing: a new architecture for high-performance stream systems. Proceedings of the VLDB Endowment 1, 1 (2008), 274--288. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. MongoDB. 2020. MongoDB. Retrieved November 12, 2020 from https://www.mongodb.comGoogle ScholarGoogle Scholar
  28. Hannaneh Najdataei, Yiannis Nikolakopoulos, Vincenzo Gulisano, and Marina Papatriantafilou. 2018. Continuous and Parallel LiDAR Point-Cloud Clustering. In 2018 IEEE 38th International Conference on Distributed Computing Systems (ICDCS). IEEE Computer Society, Vienna, Austria, 671--684. Google ScholarGoogle ScholarCross RefCross Ref
  29. Neo4j. 2020. Neo4j. Retrieved November 12, 2020 from https://neo4j.com/Google ScholarGoogle Scholar
  30. Dimitris Palyvos-Giannas, Vincenzo Gulisano, and Marina Papatriantafilou. 2019. GeneaLog: Fine-grained data streaming provenance in cyber-physical systems. Parallel Comput. 89 (2019), 102552.Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Dimitris Palyvos-Giannas, Vincenzo Gulisano, and Marina Papatriantafilou. 2019. Haren: A Framework for Ad-Hoc Thread Scheduling Policies for Data Streaming Applications. In Proceedings of the 13th ACM International Conference on Distributed and Event-Based Systems (DEBS '19). ACM, Darmstadt, Germany, 19--30. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Fabio Pasqualetti, Florian Dörfler, and Francesco Bullo. 2013. Attack Detection and Identification in Cyber-Physical Systems. IEEE Trans. Automat. Control 58, 11 (2013), 2715--2729.Google ScholarGoogle ScholarCross RefCross Ref
  33. PostgreSQL. 2020. PostgreSQL. Retrieved November 12, 2020 from https://www.postgresql.orgGoogle ScholarGoogle Scholar
  34. Saeed Salah, Gabriel Maciá-Fernández, and Jesús E. Díaz-Verdejo. 2013. A model-based survey of alert correlation techniques. Computer Networks 57, 5 (2013), 1289 -- 1317. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. SQLite. 2020. SQLite. Retrieved November 12, 2020 from https://www.sqlite.org/Google ScholarGoogle Scholar
  36. Michael Stonebraker, Uǧur Çetintemel, and Stan Zdonik. 2005. The 8 requirements of real-time stream processing. ACM Sigmod Record 34, 4 (2005), 42--47. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Joris van Rooij, Vincenzo Gulisano, and Marina Papatriantafilou. 2018. LoCo-Volt: Distributed Detection of Broken Meters in Smart Grids through Stream Processing. In Proceedings of the 12th ACM International Conference on Distributed and Event-Based Systems (Hamilton, New Zealand) (DEBS '18). Association for Computing Machinery, New York, NY, USA, 171--182. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Chad Vicknair, Michael Macias, Zhendong Zhao, Xiaofei Nan, Yixin Chen, and Dawn Wilkins. 2010. A Comparison of a Graph Database and a Relational Database: A Data Provenance Perspective. In Proceedings of the 48th Annual Southeast Regional Conference (Oxford, Mississippi) (ACM SE '10). Association for Computing Machinery, New York, NY, USA, Article 42, 6 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Nithya N. Vijayakumar and Beth Plale. 2006. Towards Low Overhead Provenance Tracking in Near Real-Time Stream Filtering. In Provenance and Annotation of Data, Luc Moreau and Ian Foster (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 46--54. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Min Wang, Marion Blount, John Davis, Archan Misra, and Daby Sow. 2007. A Time-and-value Centric Provenance Model and Architecture for Medical Event Streams. In Proceedings of the 1st ACM SIGMOBILE International Workshop on Systems and Networking Support for Healthcare and Assisted Living Environments (San Juan, Puerto Rico) (HealthNet '07). ACM, New York, NY, USA, 95--100. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Yu Zheng, Xing Xie, and Wei-Ying Ma. 2010. Geolife: A collaborative social networking service among user, location and trajectory. IEEE Data Eng. Bull. 33, 2 (2010), 32--39.Google ScholarGoogle Scholar

Index Terms

  1. Ananke: a streaming framework for live forward provenance
      Index terms have been assigned to the content through auto-classification.

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image Proceedings of the VLDB Endowment
        Proceedings of the VLDB Endowment  Volume 14, Issue 3
        November 2020
        217 pages
        ISSN:2150-8097
        Issue’s Table of Contents

        Publisher

        VLDB Endowment

        Publication History

        • Published: 1 November 2020
        Published in pvldb Volume 14, Issue 3

        Qualifiers

        • research-article

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader