ABSTRACT
Applications like social networking, urban monitoring and market feed processing require stateful stream query: a query consults not only streaming data but also stored data to extract timely information; useful information from streaming data also needs to be continuously and consistently integrated into stored data to serve inflight and future queries. However, prior streaming systems either focus on stream computation, or are not stateful, or cannot provide low latency and high throughput to handle the fast-evolving linked data and increasing concurrency of queries.
This paper presents Wukong+S, a distributed stream querying engine that provides sub-millisecond stateful query at millions of queries per-second over fast-evolving linked data. Wukong+S uses an integrated design that combines the stream processor and the persistent store with efficient state sharing, which avoids the cross-system cost and sub-optimal query plan in conventional composite designs (e.g., Storm/Heron+Wukong). Wukong+S uses a hybrid store to differentially manage timeless data and timing data accordingly and provides an efficient stream index with locality-aware partitioning to facilitate fast access to streaming data. Wukong+S further provides decentralized vector timestamps with bounded snapshot scalarization to scale with nodes and massive queries at efficient memory usage.
We have designed Wukong+S conforming to the RDF data model and Continuous SPARQL (C-SPARQL) query interface and have implemented Wukong+S by extending a state-of-the-art static RDF store (namely Wukong). Evaluation on an 8-node RDMA-capable cluster using LSBench and CityBench shows that Wukong+S significantly outperforms existing system designs (e.g., CSPARQL-engine, Storm/Heron+Wukong, and Spark Streaming/Structured Streaming) for both latency and throughput, usually at the scale of orders of magnitude.
Supplemental Material
- Apache Jena. https://jena.apache.org/.Google Scholar
- Apache Kafka. http://kafka.apache.org/.Google Scholar
- DBpedia's SPARQL Benchmark. http://aksw.org/Projects/DBPSB.Google Scholar
- Esper. http//:www.espertech.com/esper/.Google Scholar
- Facebook Scribe. https://github.com/facebook/scribe.Google Scholar
- Microsoft StreamInsight Official blog. https://blogs.msdn.microsoft.com/streaminsight/.Google Scholar
- Options Price Reporting Authority. https://en.wikipedia.org/wiki/Options_Price_Reporting_Authority.Google Scholar
- RDF 1.1 Concepts and Abstract Syntax. https://www.w3.org/TR/rdf11-concepts/.Google Scholar
- D. J. Abadi, Y. Ahmad, M. Balazinska, U. Cetintemel, M. Cherniack, J.-H. Hwang, W. Lindner, A. Maskey, A. Rasin, E. Ryvkina, et al. The Design of the Borealis Stream Processing Engine. In CIDR, volume 5, pages 277--289, 2005.Google Scholar
- D. J. Abadi, D. Carney, U. Çetintemel, M. Cherniack, C. Convey, S. Lee, M. Stonebraker, N. Tatbul, and S. Zdonik. Aurora: A New Model and Architecture for Data Stream Management. The VLDB Journal, 12(2):120--139, Aug. 2003. Google ScholarDigital Library
- T. Akidau, A. Balikov, K. Bekiroğlu, S. Chernyak, J. Haberman, R. Lax, S. McVeety, D. Mills, P. Nordstrom, and S. Whittle. MillWheel: Fault-tolerant Stream Processing at Internet Scale. Proc. VLDB Endow., 6(11):1033--1044, Aug. 2013. Google ScholarDigital Library
- M. I. Ali, F. Gao, and A. Mileo. CityBench: A Configurable Benchmark to Evaluate RSP Engines Using Smart City Datasets. In Proc. ISWC, 2015.Google ScholarCross Ref
- R. Angles and C. Gutierrez. Survey of Graph Database Models. ACM Comput. Surv., 40(1):1:1--1:39, Feb. 2008. Google ScholarDigital Library
- M. Armbrust, R. S. Xin, C. Lian, Y. Huai, D. Liu, J. K. Bradley, X. Meng, T. Kaftan, M. J. Franklin, A. Ghodsi, and M. Zaharia. Spark SQL: Relational Data Processing in Spark. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, SIGMOD '15, pages 1383--1394, New York, NY, USA, 2015. ACM. Google ScholarDigital Library
- D. F. Barbieri, D. Braga, S. Ceri, E. D. VALLE, and M. Grossniklaus. C-SPARQL: A Continuous Query Language for RDF Data Streams. International Journal of Semantic Computing, 4(01):3--25, 2010.Google ScholarCross Ref
- Bio2RDF Consortium. Bio2RDF: Linked Data for the Life Science. http://bio2rdf.org/, 2014.Google Scholar
- M. H. Böhlen. Temporal Database System Implementations. SIGMOD Rec., 24(4):53--60, Dec. 1995. Google ScholarDigital Library
- I. Botan, Y. Cho, R. Derakhshan, N. Dindar, L. Haas, K. Kim, C. Lee, G. Mundada, M.-C. Shan, N. Tatbul, et al. Design and Implementation of the MaxStream Federated Stream Processing Architecture. ETH Zurich, Computer Science, Tech. Rep, 632, 2009.Google Scholar
- C. Chambers, A. Raniwala, F. Perry, S. Adams, R. R. Henry, R. Bradshaw, and N. Weizenbaum. FlumeJava: Easy, Efficient Data-parallel Pipelines. In Proceedings of the 31st ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI '10, pages 363--375, New York, NY, USA, 2010. ACM. Google ScholarDigital Library
- G. J. Chen, J. L. Wiener, S. Iyer, A. Jaiswal, R. Lei, N. Simha, W. Wang, K. Wilfong, T. Williamson, and S. Yilmaz. Realtime Data Processing at Facebook. In Proceedings of the 2016 International Conference on Management of Data, SIGMOD '16, pages 1087--1098, New York, NY, USA, 2016. ACM. Google ScholarDigital Library
- Y. Chen, X. Wei, J. Shi, R. Chen, and H. Chen. Fast and General Distributed Transactions Using RDMA and HTM. In Proceedings of the Eleventh European Conference on Computer Systems, EuroSys '16, pages 26:1--26:17, New York, NY, USA, 2016. ACM. Google ScholarDigital Library
- R. Cheng, J. Hong, A. Kyrola, Y. Miao, X. Weng, M. Wu, F. Yang, L. Zhou, F. Zhao, and E. Chen. Kineograph: Taking the Pulse of a Fast-changing and Connected World. In Proceedings of the 7th ACM European Conference on Computer Systems, EuroSys '12, pages 85-- 98, New York, NY, USA, 2012. ACM. Google ScholarDigital Library
- Google Inc. Introducing the knowledge graph: things, not strings. https://googleblog.blogspot.co.uk/2012/05/introducing-knowledge-graph-things-not.html, 2012.Google Scholar
- S. Gurajada, S. Seufert, I. Miliaraki, and M. Theobald. TriAD: A Distributed Shared-nothing RDF Engine Based on Asynchronous Message Passing. In Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, SIGMOD '14, pages 289--300, New York, NY, USA, 2014. ACM. Google ScholarDigital Library
- J. Han, G. Dong, and Y. Yin. Efficient Mining of Partial Periodic Patterns in Time Series Database. In Proceedings of the 15th International Conference on Data Engineering, ICDE '99, pages 106--115. IEEE, 1999. Google ScholarDigital Library
- J.-H. Hwang, M. Balazinska, A. Rasin, U. Cetintemel, M. Stonebraker, and S. Zdonik. High-availability Algorithms for Distributed Stream Processing. In Proceedings of the 21st International Conference on Data Engineering, ICDE '05, pages 779--790. IEEE, 2005. Google ScholarDigital Library
- S. Kulkarni, N. Bhagat, M. Fu, V. Kedigehalli, C. Kellogg, S. Mittal, J. M. Patel, K. Ramasamy, and S. Taneja. Twitter Heron: Stream Processing at Scale. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, SIGMOD '15, pages 239--250, New York, NY, USA, 2015. ACM. Google ScholarDigital Library
- D. Le-Phuoc, M. Dao-Tran, M.-D. Pham, P. Boncz, T. Eiter, and M. Fink. Linked Stream Data Processing Engines: Facts and Figures. In Proc. ISWC, 2012. Google ScholarDigital Library
- W. Lin, H. Fan, Z. Qian, J. Xu, S. Yang, J. Zhou, and L. Zhou. StreamScope: Continuous Reliable Distributed Processing of Big Data Streams. In NSDI, pages 439--453, 2016. Google ScholarDigital Library
- C. Liu, R. Correa, H. Gill, T. Gill, X. Li, S. Muthukumar, T. Saeed, B. T. Loo, and P. Basu. PUMA: Policy-based Unified Multiradio Architecture for Agile Mesh Networking. IEEE/ACM Trans. Netw., 22(6):1897--1910, Dec. 2014. Google ScholarDigital Library
- S. Madden and M. J. Franklin. Fjording the stream: An architecture for queries over streaming sensor data. In Proceedings of the 18th International Conference on Data Engineering, ICDE '02, pages 555-- 566. IEEE, 2002. Google ScholarDigital Library
- A. Mauri, J.-P. Calbimonte, D. Dellâăźaglio, M. Balduini, M. Brambilla, E. Della Valle, and K. Aberer. TripleWave: Spreading RDF Streams on the Web. In International Semantic Web Conference, pages 140--149. Springer, 2016.Google Scholar
- J. Meehan, N. Tatbul, S. Zdonik, C. Aslantas, U. Cetintemel, J. Du, T. Kraska, S. Madden, D. Maier, A. Pavlo, M. Stonebraker, K. Tufte, and H. Wang. S-Store: Streaming Meets Transaction Processing. Proc. VLDB Endow., 8(13):2134--2145, Sept. 2015. Google ScholarDigital Library
- R. Motwani, J. Widom, A. Arasu, B. Babcock, S. Babu, M. Datar, G. Manku, C. Olston, J. Rosenstein, and R. Varma. Query Processing, Resource Management, and Approximation in a Data Stream Management System. In CIDR. Citeseer, 2003.Google Scholar
- D. G. Murray, F. McSherry, R. Isaacs, M. Isard, P. Barham, and M. Abadi. Naiad: A Timely Dataflow System. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles, SOSP '13, pages 439--455, New York, NY, USA, 2013. ACM. Google ScholarDigital Library
- National Center for Biotechnology Information. Pubchemrdf. https://pubchem.ncbi.nlm.nih.gov/rdf/, 2014.Google Scholar
- T. Neumann and G. Weikum. RDF-3X: A RISC-style Engine for RDF. Proc. VLDB Endow., 1(1):647--659, Aug. 2008. Google ScholarDigital Library
- L. Neumeyer, B. Robbins, A. Nair, and A. Kesari. S4: Distributed stream computing platform. In Proceedings of the IEEE International Conference on Data Mining Workshops, ICDMW '10, pages 170--177. IEEE, 2010. Google ScholarDigital Library
- S. A. Noghabi, K. Paramasivam, Y. Pan, N. Ramesh, J. Bringhurst, I. Gupta, and R. H. Campbell. Samza: Stateful Scalable Stream Processing at LinkedIn. Proc. VLDB Endow., 10(12):1634--1645, Aug. 2017. Google ScholarDigital Library
- Z. Qian, Y. He, C. Su, Z. Wu, H. Zhu, T. Zhang, L. Zhou, Y. Yu, and Z. Zhang. TimeStream: Reliable Stream Computation in the Cloud. In Proceedings of the 8th ACM European Conference on Computer Systems, EuroSys '13, pages 1--14, New York, NY, USA, 2013. ACM. Google ScholarDigital Library
- J. Shi, Y. Yao, R. Chen, H. Chen, and F. Li. Fast and Concurrent RDF Queries with RDMA-based Distributed Graph Exploration. In Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation, OSDI'16, pages 317--332, Berkeley, CA, USA, 2016. USENIX Association. Google ScholarDigital Library
- X. Shi, B. Cui, Y. Shao, and Y. Tong. Tornado: A System For RealTime Iterative Analysis Over Evolving Data. In Proceedings of the 2016 International Conference on Management of Data, SIGMOD '16, pages 417--430, New York, NY, USA, 2016. ACM. Google ScholarDigital Library
- C. Sirish, C. Owen, D. Amol, H. Wei, K. Sailesh, M. Samuel, R. Vijayshankar, and R. Frederick. TelegraphCQ: Continuous Dataflow Processing for An Uncertain World. In CIDR, pages 46--58, 2003.Google Scholar
- Streamreasoning Reserach Group. CSPARQL Engine. https://github.com/streamreasoning/CSPARQL-engine.Google Scholar
- A. Toshniwal, S. Taneja, A. Shukla, K. Ramasamy, J. M. Patel, S. Kulkarni, J. Jackson, K. Gade, M. Fu, J. Donham, N. Bhagat, S. Mittal, and D. Ryaboy. Storm@twitter. In Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, SIGMOD '14, pages 147--156, New York, NY, USA, 2014. ACM. Google ScholarDigital Library
- R. Van Bruggen. Demining the "Join Bomb" with Graph Queries. https://neo4j.com/blog/demining-the-join-bomb-with-graph-queries, 2013.Google Scholar
- K. Vora, R. Gupta, and G. Xu. KickStarter: Fast and Accurate Computations on Streaming Graphs via Trimmed Approximations. In Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS '17, pages 237--251, New York, NY, USA, 2017. ACM. Google ScholarDigital Library
- X. Wei, S. Shen, R. Chen, and H. Chen. Replication-driven live reconfiguration for fast distributed transaction processing. In Proceedings of the USENIX Annual Technical Conference, USENIX ATC'17, pages 335--347, 2017. Google ScholarDigital Library
- W. Wu, H. Li, H. Wang, and K. Q. Zhu. Probase: A Probabilistic Taxonomy for Text Understanding. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, SIGMOD '12, pages 481--492, New York, NY, USA, 2012. ACM. Google ScholarDigital Library
- M. Zaharia, T. Das, M. Armbrust, and R. Xin. Structured Streaming In Apache Spark -- A new high-level API for streaming. https://databricks.com/blog/2016/07/28/structured-streaming-in-apache-spark.html.Google Scholar
- M. Zaharia, T. Das, H. Li, T. Hunter, S. Shenker, and I. Stoica. Discretized Streams: Fault-tolerant Streaming Computation at Scale. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles, SOSP '13, pages 423--438, New York, NY, USA, 2013. ACM. Google ScholarDigital Library
- K. Zeng, J. Yang, H. Wang, B. Shao, and Z. Wang. A distributed graph engine for web scale rdf data. In Proceedings of the 39th international conference on Very Large Data Bases, PVLDB'13, pages 265--276. VLDB Endowment, 2013. Google ScholarDigital Library
Recommendations
Enabling Real-Time Querying of Live and Historical Stream Data
SSDBM '07: Proceedings of the 19th International Conference on Scientific and Statistical Database ManagementApplications that query data streams in order to identify trends, patterns, or anomalies can often benefit from comparing the live stream data with archived historical stream data. However, searching this historical data in real time has been considered ...
Comments