Abstract
In recent years the processing of continuous queries over potentially infinite data streams has attracted a lot of research attention. We observed that the majority of work addresses individual stream operations and system-related issues rather than the development of a general-purpose basis for stream processing systems. Furthermore, example continuous queries are often formulated in some declarative query language without specifying the underlying semantics precisely enough. To overcome these deficiencies, this article presents a consistent and powerful operator algebra for data streams which ensures that continuous queries have well-defined, deterministic results. In analogy to traditional database systems, we distinguish between a logical and a physical operator algebra. While the logical algebra specifies the semantics of the individual operators in a descriptive but concrete way over temporal multisets, the physical algebra provides efficient implementations in the form of stream-to-stream operators. By adapting and enhancing research from temporal databases to meet the challenging requirements in streaming applications, we are able to carry over the conventional transformation rules from relational databases to stream processing. For this reason, our approach not only makes it possible to express continuous queries with a sound semantics, but also provides a solid foundation for query optimization, one of the major research topics in the stream community. Since this article seamlessly explains the steps from query formulation to query execution, it outlines the innovative features and operational functionality implemented in our state-of-the-art stream processing infrastructure.
Supplemental Material
Available for Download
Online appendix to semantics and implementation of continuous sliding window queries over data streams. The appendix supports the information on article 4.
- Abadi, D. J., Carney, D., Cetintemel, U., Cherniack, M., Convey, C., Lee, S., Stonebraker, M., Tatbul, N., and Zdonik, S. 2003. Aurora: A new model and architecture for data stream management. VLDB J. 12, 2, 120--139. Google ScholarDigital Library
- Arasu, A., Babcock, B., Babu, S., McAlister, J., and Widom, J. 2002. Characterizing memory requirements for queries over continuous data streams. In Proceedings of the ACM SIGMOD International Conference on Management of Data, 221--232. Google ScholarDigital Library
- Arasu, A., Babu, S., and Widom, J. 2006. The CQL continuous query language: Semantic foundations and query execution. VLDB J. 15, 2, 121--142. Google ScholarDigital Library
- Babcock, B., Babu, S., Datar, M., and Motwani, R. 2003. Chain: Operator scheduling for memory minimization in data stream systems. In Proceedings of the ACM SIGMOD International Conference on Management of Data, 253--264. Google ScholarDigital Library
- Babcock, B., Babu, S., Datar, M., Motwani, R., and Widom, J. 2002. Models and issues in data stream systems. In Proceedings of the ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, 1--16. Google ScholarDigital Library
- Babu, S., Munagala, K., Widom, J., and Motwani, R. 2005. Adaptive caching for continuous queries. In Proceedings of the International Conference on Data Engineering (ICDE), 118--129. Google ScholarDigital Library
- Bai, Y., Thakkar, H., Wang, H., Luo, C., and Zaniolo, C. 2006. A data stream language and system designed for power and extensibility. In Proceedings of the International Conference on Information and Knowledge Management (CIKM), 337--346. Google ScholarDigital Library
- Barga, R. S., Goldstein, J., Ali, M. H., and Hong, M. 2007. Consistent streaming through time: A vision for event stream processing. In Proceedings of the Conference on Innovative Data Systems Research (CIDR), 363--374.Google Scholar
- Cammert, M., Krämer, J., Seeger, B., and Vaupel, S. 2008. A cost-based approach to adaptive resource management in data stream systems. IEEE Trans. Knowl. Data Eng. 20, 2, 230--245. Google ScholarDigital Library
- Carney, D., Cetintemel, U., Zdonik, S., Rasin, A., Cerniak, M., and Stonebraker, M. 2003. Operator scheduling in a data stream manager. In Proceedings of the International Conference on Very Large Databases (VLDB), 838--849. Google ScholarDigital Library
- Chandrasekaran, S., Cooper, O., Deshpande, A., and et al. 2003. TelegraphCQ: Continuous dataflow processing for an uncertain world. In Proceedings of the Conference on Innovative Data Systems Research (CIDR).Google Scholar
- Chandrasekaran, S. and Franklin, M. J. 2002. Streaming queries over streaming data. In Proceedings of the International Conference on Very Large Databases (VLDB), 203--214. Google ScholarDigital Library
- Chaudhuri, S., Motwani, R., and Narasayya, V. 1999. On random sampling over joins. In Proceedings of the ACM SIGMOD International Conference on Management of Data, 263--274. Google ScholarDigital Library
- Chen, J., DeWitt, D., Tian, F., and Wang, Y. 2000. NiagaraCQ: A scalable continuous query system for Internet databases. In Proceedings of the ACM SIGMOD International Conference on Management of Data, 379--390. Google ScholarDigital Library
- Cormen, T. H., Leiserson, C. E., Rivest, R. L., and Stein, C. 2001. Introduction to Algorithms, 2nd ed. The MIT Press. Google ScholarDigital Library
- Cranor, C. D., Johnson, T., Spatscheck, O., and Shkapenyuk, V. 2003. Gigascope: A stream database for network applications. In Proceedings of the ACM SIGMOD International Conference on Management of Data, 647--651. Google ScholarDigital Library
- Datar, M., Gionis, A., Indyk, P., and Motwani, R. 2002. Maintaining stream statistics over sliding windows. In Proceedings of the Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), 635--644. Google ScholarDigital Library
- Dayal, U. 1987. Of nests and trees: A unified approach to processing queries that contain nested subqueries, aggregates, and quantifiers. In Proceedings of the International Conference on Very Large Databases, 197--208. Google ScholarDigital Library
- Dayal, U., Goodman, N., and Katz, R. H. 1982. An extended relational algebra with control over duplicate elimination. In Proceedings of the ACM SIGMOD International Conference on Management of Data, 117--123. Google ScholarDigital Library
- Demers, A. J., Gehrke, J., Panda, B., Riedewald, M., Sharma, V., and White, W. M. 2007. Cayuga: A General purpose event monitoring system. In Proceedings of the Conference on Innovative Data Systems Research (CIDR), 412--422.Google Scholar
- Dittrich, J.-P., Seeger, B., Taylor, D. S., and Widmayer, P. 2002. Progressive merge join: A generic and non-blocking sort-based join algorithm. In Proceedings of the International Conference on Very Large Databases (VLDB), 299--310. Google ScholarDigital Library
- Gao, D., Jensen, C. S., Snodgrass, R. T., and Soo, M. D. 2005. Join operations in temporal databases. VLDB J. 14, 1, 2--29. Google ScholarDigital Library
- Garcia-Molina, H., Ullman, J. D., and Widom, J. 2000. Database System Implementation. Prentice Hall. Google ScholarDigital Library
- Ghanem, T. M., Hammad, M. A., Mokbel, M. F., Aref, W. G., and Elmagarmid, A. K. 2007. Incremental evaluation of sliding-window queries over data streams. IEEE Trans. Knowl. Data Eng. 19, 1, 57--72. Google ScholarDigital Library
- Golab, L. and Ozsu, M. 2003a. Processing sliding window multi-joins in continuous queries over data streams. In Proceedings of the International Conference on Very Large Databases (VLDB), 500--511. Google ScholarDigital Library
- Golab, L. and Ozsu, M. 2003b. Issues in data stream management. SIGMOD Rec. 32, 2, 5--14. Google ScholarDigital Library
- Golab, L., Ozsu, M. 2005. Update-Pattern-Aware modeling and processing of continuous queries. In Proceedings of the ACM SIGMOD International Conference on Management of Data, 658--669. Google ScholarDigital Library
- Graefe, G. 1993. Query evaluation techniques for large databases. ACM Comput. Surv. 25, 2, 73--170. Google ScholarDigital Library
- Gray, J., Chaudhuri, S., Bosworth, A., Layman, A., Reichart, D., Venkatrao, M., Pellow, F., and Pirahesh, H. 1997. Data cube: A relational aggregation operator generalizing group-by, cross- tab, and sub-totals. Data Mining Knowl. Discov. 1, 1, 29--53. Google ScholarDigital Library
- Guha, S., Meyerson, A., Mishra, N., Motwani, R., and O'Callaghan, L. 2003. Clustering data streams: Theory and practice. IEEE Trans. Knowl. Data Eng. 15, 3, 515--528. Google ScholarDigital Library
- Haas, P. J. and Hellerstein, J. M. 1999. Ripple joins for online aggregation. In Proceedings of the ACM SIGMOD International Conference on Management of Data, 287--298. Google ScholarDigital Library
- Hellerstein, J. M., Haas, P. J., and Wang, H. 1997. Online aggregation. In Proceedings of the ACM SIGMOD International Conference on Management of Data, 171--182. Google ScholarDigital Library
- Jensen, C. S., Clifford, J., Elmasri, R., Gadia, S. K., Hayes, P. J., and Jajodia, S. 1994. A consensus glossary of temporal database concepts. SIGMOD Rec. 23, 1, 52--64. Google ScholarDigital Library
- Kang, J., Naughton, J., and Viglas, S. 2003. Evaluating window joins over unbounded streams. In Proceedings of the International Conference on Data Engineering (ICDE), 341--352.Google Scholar
- Karp, R. M., Shenker, S., and Papadimitriou, C. H. 2003. A simple algorithm for finding frequent elements in streams and bags. ACM Trans. Database Syst. 28, 51--55. Google ScholarDigital Library
- Krämer, J. 2007. Continuous queries over data streams -- Semantics and implementation. Ph.D. thesis, University of Marburg.Google Scholar
- Krämer, J. and Seeger, B. 2004. PIPES - A public infrastructure for processing and exploring streams. In Proceedings of the ACM SIGMOD International Conference on Database Management, 925--926. Google ScholarDigital Library
- Krämer, J. and Seeger, B. 2005. A temporal foundation for continuous queries over data streams. In Proceedings of the International Conference on Management of Data (COMAD), 70--82.Google Scholar
- Law, Y.-N., Wang, H., and Zaniolo, C. 2004. Query languages and data models for database sequences and data streams. In Proceedings of the International Conference on Very Large Databases (VLDB), 492--503. Google ScholarDigital Library
- Li, J., Maier, D., Tufte, K., Papadimos, V., and Tucker, P. A. 2005. Semantics and evaluation techniques for window aggregates in data streams. In Proceedings of the ACM SIGMOD International Conference on Management of Data, 311--322. Google ScholarDigital Library
- Madden, S., Franklin, M. J., Hellerstein, J. M., and Hong, W. 2002. TAG: A tiny aggregation service for ad-hoc sensor networks. In Symposium on Operating System Design and Implementation (OSDI). Google ScholarDigital Library
- Manku, G. S. and Motwani, R. 2002. Approximate frequency counts over data streams. In Proceedings of the International Conference on Very Large Databases (VLDB). Google ScholarDigital Library
- Nievergelt, J., and Preparata, F. P. 1982. Plane-Sweep algorithms for intersecting geometric figures. Commun. ACM 25, 10, 739--747. Google ScholarDigital Library
- Patroumpas, K. and Sellis, T. K. 2006. Window specification over data streams. In Proceedings of the EDBT Workshops, 445--464. Google ScholarDigital Library
- Raman, V., Deshpande, A., and Hellerstein, J. M. 2003. Using state modules for adaptive query processing. In Proceedings of the International Conference on Data Engineering (ICDE), 353.Google Scholar
- Roy, P., Seshadri, S., Sudarshan, S., and Bhobe, S. 2000. Efficient and extensible algorithms for multi query optimization. In Proceedings of the ACM SIGMOD International Conference on Management of Data, 249--260. Google ScholarDigital Library
- Sellis, T. K. 1988. Multiple-Query optimization. ACM Trans. Database Syst. 13, 1, 23--52. Google ScholarDigital Library
- Slivinskas, G., Jensen, C. S., and Snodgrass, R. T. 2001. A foundation for conventional and temporal query optimization addressing duplicates and ordering. IEEE Trans. Knowl. Data Eng. 13, 1, 21--49. Google ScholarDigital Library
- SQR. 2003. SQR -- A stream query repository. http://www.db.stanford.edu/stream/sqr.Google Scholar
- Srivastava, U. and Widom, J. 2004. Flexible time management in data stream systems. In Proceedings of the ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems. Google ScholarDigital Library
- Tansel, A., Clifford, J., Gadia, S., Jajodia, S., Segev, A., and Snodgrass, R. T. 1993. Temporal Databases: Theory, Design, and Implementation. Benjamin/Cummings. Google ScholarDigital Library
- Tatbul, N., Cetintemel, U., Zdonik, S. B., Cherniack, M., and Stonebraker, M. 2003. Load shedding in a data stream manager. In Proceedings of the International Conference on Very Large Databases (VLDB), 309--320. Google ScholarDigital Library
- Tucker, P. A., Maier, D., Sheard, T., and Fegaras, L. 2003. Exploiting punctuation semantics in continuous data streams. IEEE Trans. Knowl. Data Eng. 15, 3, 555--568. Google ScholarDigital Library
- Tucker, P. A., Tufte, K., Papadimos, V., and Maier, D. 2002. NEXMark -- A benchmark for queries over data streams. http://www.cse.ogi.edu/dot/niagara/NEXMark.Google Scholar
- Viglas, S. D. and Naughton, J. F. 2002. Rate-Based query optimization for streaming information sources. In Proceedings of the ACM SIGMOD International Conference on Management of Data, 37--48. Google ScholarDigital Library
- Viglas, S. D., Naughton, J. F., and Burger, J. 2003. Maximizing the output rate of multi-join queries over streaming information sources. In Proceedings of the International Conference on Very Large Databases (VLDB), 285--296. Google ScholarDigital Library
- Wang, H., Zaniolo, C., and Luo, C. 2003. ATLaS: A small but complete SQL extension for data mining and data streams. In Proceedings of the International Conference on Very Large Databases (VLDB), 1113--1116. Google ScholarDigital Library
- Yang, J. and Widom, J. 2001. Incremental computation and maintenance of temporal aggregates. In Proceedings of the International Conference on Data Engineering, 51--60. Google ScholarDigital Library
- Yang, Y., Krämer, J., Papadias, D., and Seeger, B. 2007. HybMig: A hybrid approach to dynamic plan migration for continuous queries. IEEE Trans. Knowl. Data Eng. 19, 3, 398--411. Google ScholarDigital Library
- Zhu, Y., Rundensteiner, E. A., and Heineman, G. T. 2004. Dynamic plan migration for continuous queries over data streams. In Proceedings of the ACM SIGMOD International Conference on Management of Data, 431--442. Google ScholarDigital Library
Index Terms
- Semantics and implementation of continuous sliding window queries over data streams
Recommendations
Sketching distributed sliding-window data streams
While traditional data management systems focus on evaluating single, ad hoc queries over static data sets in a centralized setting, several emerging applications require (possibly, continuous) answers to queries on dynamic data that is widely ...
Exploiting Punctuation Semantics in Continuous Data Streams
As most current query processing architectures are already pipelined, it seems logical to apply them to data streams. However, two classes of query operators are impractical for processing long or infinite data streams. Unbounded stateful operators ...
Continuous query processing in data streams using duality of data and queries
SIGMOD '06: Proceedings of the 2006 ACM SIGMOD international conference on Management of dataRecent data stream systems such as TelegraphCQ have employed the well-known property of duality between data and queries. In these systems, query processing methods are classified into two dual categories -- data-initiative and query-initiative -- ...
Comments