skip to main content
10.1145/3132747.3132777acmconferencesArticle/Chapter ViewAbstractPublication PagessospConference Proceedingsconference-collections
research-article

Sub-millisecond Stateful Stream Querying over Fast-evolving Linked Data

Authors Info & Claims
Published:14 October 2017Publication History

ABSTRACT

Applications like social networking, urban monitoring and market feed processing require stateful stream query: a query consults not only streaming data but also stored data to extract timely information; useful information from streaming data also needs to be continuously and consistently integrated into stored data to serve inflight and future queries. However, prior streaming systems either focus on stream computation, or are not stateful, or cannot provide low latency and high throughput to handle the fast-evolving linked data and increasing concurrency of queries.

This paper presents Wukong+S, a distributed stream querying engine that provides sub-millisecond stateful query at millions of queries per-second over fast-evolving linked data. Wukong+S uses an integrated design that combines the stream processor and the persistent store with efficient state sharing, which avoids the cross-system cost and sub-optimal query plan in conventional composite designs (e.g., Storm/Heron+Wukong). Wukong+S uses a hybrid store to differentially manage timeless data and timing data accordingly and provides an efficient stream index with locality-aware partitioning to facilitate fast access to streaming data. Wukong+S further provides decentralized vector timestamps with bounded snapshot scalarization to scale with nodes and massive queries at efficient memory usage.

We have designed Wukong+S conforming to the RDF data model and Continuous SPARQL (C-SPARQL) query interface and have implemented Wukong+S by extending a state-of-the-art static RDF store (namely Wukong). Evaluation on an 8-node RDMA-capable cluster using LSBench and CityBench shows that Wukong+S significantly outperforms existing system designs (e.g., CSPARQL-engine, Storm/Heron+Wukong, and Spark Streaming/Structured Streaming) for both latency and throughput, usually at the scale of orders of magnitude.

Skip Supplemental Material Section

Supplemental Material

stream_query.mp4

mp4

1.8 GB

References

  1. Apache Jena. https://jena.apache.org/.Google ScholarGoogle Scholar
  2. Apache Kafka. http://kafka.apache.org/.Google ScholarGoogle Scholar
  3. DBpedia's SPARQL Benchmark. http://aksw.org/Projects/DBPSB.Google ScholarGoogle Scholar
  4. Esper. http//:www.espertech.com/esper/.Google ScholarGoogle Scholar
  5. Facebook Scribe. https://github.com/facebook/scribe.Google ScholarGoogle Scholar
  6. Microsoft StreamInsight Official blog. https://blogs.msdn.microsoft.com/streaminsight/.Google ScholarGoogle Scholar
  7. Options Price Reporting Authority. https://en.wikipedia.org/wiki/Options_Price_Reporting_Authority.Google ScholarGoogle Scholar
  8. RDF 1.1 Concepts and Abstract Syntax. https://www.w3.org/TR/rdf11-concepts/.Google ScholarGoogle Scholar
  9. D. J. Abadi, Y. Ahmad, M. Balazinska, U. Cetintemel, M. Cherniack, J.-H. Hwang, W. Lindner, A. Maskey, A. Rasin, E. Ryvkina, et al. The Design of the Borealis Stream Processing Engine. In CIDR, volume 5, pages 277--289, 2005.Google ScholarGoogle Scholar
  10. D. J. Abadi, D. Carney, U. Çetintemel, M. Cherniack, C. Convey, S. Lee, M. Stonebraker, N. Tatbul, and S. Zdonik. Aurora: A New Model and Architecture for Data Stream Management. The VLDB Journal, 12(2):120--139, Aug. 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. T. Akidau, A. Balikov, K. Bekiroğlu, S. Chernyak, J. Haberman, R. Lax, S. McVeety, D. Mills, P. Nordstrom, and S. Whittle. MillWheel: Fault-tolerant Stream Processing at Internet Scale. Proc. VLDB Endow., 6(11):1033--1044, Aug. 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. M. I. Ali, F. Gao, and A. Mileo. CityBench: A Configurable Benchmark to Evaluate RSP Engines Using Smart City Datasets. In Proc. ISWC, 2015.Google ScholarGoogle ScholarCross RefCross Ref
  13. R. Angles and C. Gutierrez. Survey of Graph Database Models. ACM Comput. Surv., 40(1):1:1--1:39, Feb. 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. M. Armbrust, R. S. Xin, C. Lian, Y. Huai, D. Liu, J. K. Bradley, X. Meng, T. Kaftan, M. J. Franklin, A. Ghodsi, and M. Zaharia. Spark SQL: Relational Data Processing in Spark. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, SIGMOD '15, pages 1383--1394, New York, NY, USA, 2015. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. D. F. Barbieri, D. Braga, S. Ceri, E. D. VALLE, and M. Grossniklaus. C-SPARQL: A Continuous Query Language for RDF Data Streams. International Journal of Semantic Computing, 4(01):3--25, 2010.Google ScholarGoogle ScholarCross RefCross Ref
  16. Bio2RDF Consortium. Bio2RDF: Linked Data for the Life Science. http://bio2rdf.org/, 2014.Google ScholarGoogle Scholar
  17. M. H. Böhlen. Temporal Database System Implementations. SIGMOD Rec., 24(4):53--60, Dec. 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. I. Botan, Y. Cho, R. Derakhshan, N. Dindar, L. Haas, K. Kim, C. Lee, G. Mundada, M.-C. Shan, N. Tatbul, et al. Design and Implementation of the MaxStream Federated Stream Processing Architecture. ETH Zurich, Computer Science, Tech. Rep, 632, 2009.Google ScholarGoogle Scholar
  19. C. Chambers, A. Raniwala, F. Perry, S. Adams, R. R. Henry, R. Bradshaw, and N. Weizenbaum. FlumeJava: Easy, Efficient Data-parallel Pipelines. In Proceedings of the 31st ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI '10, pages 363--375, New York, NY, USA, 2010. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. G. J. Chen, J. L. Wiener, S. Iyer, A. Jaiswal, R. Lei, N. Simha, W. Wang, K. Wilfong, T. Williamson, and S. Yilmaz. Realtime Data Processing at Facebook. In Proceedings of the 2016 International Conference on Management of Data, SIGMOD '16, pages 1087--1098, New York, NY, USA, 2016. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Y. Chen, X. Wei, J. Shi, R. Chen, and H. Chen. Fast and General Distributed Transactions Using RDMA and HTM. In Proceedings of the Eleventh European Conference on Computer Systems, EuroSys '16, pages 26:1--26:17, New York, NY, USA, 2016. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. R. Cheng, J. Hong, A. Kyrola, Y. Miao, X. Weng, M. Wu, F. Yang, L. Zhou, F. Zhao, and E. Chen. Kineograph: Taking the Pulse of a Fast-changing and Connected World. In Proceedings of the 7th ACM European Conference on Computer Systems, EuroSys '12, pages 85-- 98, New York, NY, USA, 2012. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Google Inc. Introducing the knowledge graph: things, not strings. https://googleblog.blogspot.co.uk/2012/05/introducing-knowledge-graph-things-not.html, 2012.Google ScholarGoogle Scholar
  24. S. Gurajada, S. Seufert, I. Miliaraki, and M. Theobald. TriAD: A Distributed Shared-nothing RDF Engine Based on Asynchronous Message Passing. In Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, SIGMOD '14, pages 289--300, New York, NY, USA, 2014. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. J. Han, G. Dong, and Y. Yin. Efficient Mining of Partial Periodic Patterns in Time Series Database. In Proceedings of the 15th International Conference on Data Engineering, ICDE '99, pages 106--115. IEEE, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. J.-H. Hwang, M. Balazinska, A. Rasin, U. Cetintemel, M. Stonebraker, and S. Zdonik. High-availability Algorithms for Distributed Stream Processing. In Proceedings of the 21st International Conference on Data Engineering, ICDE '05, pages 779--790. IEEE, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. S. Kulkarni, N. Bhagat, M. Fu, V. Kedigehalli, C. Kellogg, S. Mittal, J. M. Patel, K. Ramasamy, and S. Taneja. Twitter Heron: Stream Processing at Scale. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, SIGMOD '15, pages 239--250, New York, NY, USA, 2015. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. D. Le-Phuoc, M. Dao-Tran, M.-D. Pham, P. Boncz, T. Eiter, and M. Fink. Linked Stream Data Processing Engines: Facts and Figures. In Proc. ISWC, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. W. Lin, H. Fan, Z. Qian, J. Xu, S. Yang, J. Zhou, and L. Zhou. StreamScope: Continuous Reliable Distributed Processing of Big Data Streams. In NSDI, pages 439--453, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. C. Liu, R. Correa, H. Gill, T. Gill, X. Li, S. Muthukumar, T. Saeed, B. T. Loo, and P. Basu. PUMA: Policy-based Unified Multiradio Architecture for Agile Mesh Networking. IEEE/ACM Trans. Netw., 22(6):1897--1910, Dec. 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. S. Madden and M. J. Franklin. Fjording the stream: An architecture for queries over streaming sensor data. In Proceedings of the 18th International Conference on Data Engineering, ICDE '02, pages 555-- 566. IEEE, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. A. Mauri, J.-P. Calbimonte, D. Dellâăźaglio, M. Balduini, M. Brambilla, E. Della Valle, and K. Aberer. TripleWave: Spreading RDF Streams on the Web. In International Semantic Web Conference, pages 140--149. Springer, 2016.Google ScholarGoogle Scholar
  33. J. Meehan, N. Tatbul, S. Zdonik, C. Aslantas, U. Cetintemel, J. Du, T. Kraska, S. Madden, D. Maier, A. Pavlo, M. Stonebraker, K. Tufte, and H. Wang. S-Store: Streaming Meets Transaction Processing. Proc. VLDB Endow., 8(13):2134--2145, Sept. 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. R. Motwani, J. Widom, A. Arasu, B. Babcock, S. Babu, M. Datar, G. Manku, C. Olston, J. Rosenstein, and R. Varma. Query Processing, Resource Management, and Approximation in a Data Stream Management System. In CIDR. Citeseer, 2003.Google ScholarGoogle Scholar
  35. D. G. Murray, F. McSherry, R. Isaacs, M. Isard, P. Barham, and M. Abadi. Naiad: A Timely Dataflow System. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles, SOSP '13, pages 439--455, New York, NY, USA, 2013. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. National Center for Biotechnology Information. Pubchemrdf. https://pubchem.ncbi.nlm.nih.gov/rdf/, 2014.Google ScholarGoogle Scholar
  37. T. Neumann and G. Weikum. RDF-3X: A RISC-style Engine for RDF. Proc. VLDB Endow., 1(1):647--659, Aug. 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. L. Neumeyer, B. Robbins, A. Nair, and A. Kesari. S4: Distributed stream computing platform. In Proceedings of the IEEE International Conference on Data Mining Workshops, ICDMW '10, pages 170--177. IEEE, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. S. A. Noghabi, K. Paramasivam, Y. Pan, N. Ramesh, J. Bringhurst, I. Gupta, and R. H. Campbell. Samza: Stateful Scalable Stream Processing at LinkedIn. Proc. VLDB Endow., 10(12):1634--1645, Aug. 2017. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Z. Qian, Y. He, C. Su, Z. Wu, H. Zhu, T. Zhang, L. Zhou, Y. Yu, and Z. Zhang. TimeStream: Reliable Stream Computation in the Cloud. In Proceedings of the 8th ACM European Conference on Computer Systems, EuroSys '13, pages 1--14, New York, NY, USA, 2013. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. J. Shi, Y. Yao, R. Chen, H. Chen, and F. Li. Fast and Concurrent RDF Queries with RDMA-based Distributed Graph Exploration. In Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation, OSDI'16, pages 317--332, Berkeley, CA, USA, 2016. USENIX Association. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. X. Shi, B. Cui, Y. Shao, and Y. Tong. Tornado: A System For RealTime Iterative Analysis Over Evolving Data. In Proceedings of the 2016 International Conference on Management of Data, SIGMOD '16, pages 417--430, New York, NY, USA, 2016. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. C. Sirish, C. Owen, D. Amol, H. Wei, K. Sailesh, M. Samuel, R. Vijayshankar, and R. Frederick. TelegraphCQ: Continuous Dataflow Processing for An Uncertain World. In CIDR, pages 46--58, 2003.Google ScholarGoogle Scholar
  44. Streamreasoning Reserach Group. CSPARQL Engine. https://github.com/streamreasoning/CSPARQL-engine.Google ScholarGoogle Scholar
  45. A. Toshniwal, S. Taneja, A. Shukla, K. Ramasamy, J. M. Patel, S. Kulkarni, J. Jackson, K. Gade, M. Fu, J. Donham, N. Bhagat, S. Mittal, and D. Ryaboy. Storm@twitter. In Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, SIGMOD '14, pages 147--156, New York, NY, USA, 2014. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. R. Van Bruggen. Demining the "Join Bomb" with Graph Queries. https://neo4j.com/blog/demining-the-join-bomb-with-graph-queries, 2013.Google ScholarGoogle Scholar
  47. K. Vora, R. Gupta, and G. Xu. KickStarter: Fast and Accurate Computations on Streaming Graphs via Trimmed Approximations. In Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS '17, pages 237--251, New York, NY, USA, 2017. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. X. Wei, S. Shen, R. Chen, and H. Chen. Replication-driven live reconfiguration for fast distributed transaction processing. In Proceedings of the USENIX Annual Technical Conference, USENIX ATC'17, pages 335--347, 2017. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. W. Wu, H. Li, H. Wang, and K. Q. Zhu. Probase: A Probabilistic Taxonomy for Text Understanding. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, SIGMOD '12, pages 481--492, New York, NY, USA, 2012. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. M. Zaharia, T. Das, M. Armbrust, and R. Xin. Structured Streaming In Apache Spark -- A new high-level API for streaming. https://databricks.com/blog/2016/07/28/structured-streaming-in-apache-spark.html.Google ScholarGoogle Scholar
  51. M. Zaharia, T. Das, H. Li, T. Hunter, S. Shenker, and I. Stoica. Discretized Streams: Fault-tolerant Streaming Computation at Scale. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles, SOSP '13, pages 423--438, New York, NY, USA, 2013. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. K. Zeng, J. Yang, H. Wang, B. Shao, and Z. Wang. A distributed graph engine for web scale rdf data. In Proceedings of the 39th international conference on Very Large Data Bases, PVLDB'13, pages 265--276. VLDB Endowment, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in
  • Published in

    cover image ACM Conferences
    SOSP '17: Proceedings of the 26th Symposium on Operating Systems Principles
    October 2017
    677 pages
    ISBN:9781450350853
    DOI:10.1145/3132747

    Copyright © 2017 ACM

    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    • Published: 14 October 2017

    Permissions

    Request permissions about this article.

    Request Permissions

    Check for updates

    Qualifiers

    • research-article
    • Research
    • Refereed limited

    Acceptance Rates

    Overall Acceptance Rate131of716submissions,18%

    Upcoming Conference

    SOSP '24

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader