ABSTRACT
Continuous analytics systems that enable query processing over steams of data have emerged as key solutions for dealing with massive data volumes and demands for low latency. These systems have been heavily influenced by an assumption that data streams can be viewed as sequences of data that arrived more or less in order. The reality, however, is that streams are not often so well behaved and disruptions of various sorts are endemic. We argue, therefore, that stream processing needs a fundamental rethink and advocate a unified approach toward continuous analytics over discontinuous streaming data. Our approach is based on a simple insight - using techniques inspired by data parallel query processing, queries can be performed over independent sub-streams with arbitrary time ranges in parallel, generating partial results. The consolidation of the partial results over each sub-stream can then be deferred to the time at which the results are actually used on an on-demand basis. In this paper, we describe how the Truviso Continuous Analytics system implements this type of order-independent processing. Not only does the approach provide the first real solution to the problem of processing streaming data that arrives arbitrarily late, it also serves as a critical building block for solutions to a host of hard problems such as parallelism, recovery, transactional consistency, high availability, failover, and replication.
- Abadi, D., Carney, D., and Cetintemel, U., et al. Aurora: A Data Stream Management System. In Proc. CIDR 2003.Google Scholar
- Abadi, D., Ahmad, Y., Balazinska, B., et al. The Design of the Borealis Stream Processing Engine. In CIDR 2005.Google Scholar
- Arasu, A., Babcock, B., Babu, S., et al. STREAM: The Stanford Data Stream Management System. In Data Stream Management: Processing High-Speed Data Streams, Springer, 2009.Google Scholar
- Balazinska, M., Balakrishnan, H., Madden, S., Stonebraker, M. Fault-tolerance in the Borealis Distributed Stream Processing System. In SIGMOD 2005. Google ScholarDigital Library
- Bancilhon, F., et al. FAD, a powerful and simple database language. In VLDB 1987. Google ScholarDigital Library
- Chandrasekaran, S., Cooper, O., Deshpande, A., et al. TelegraphCQ: Continuous Dataflow Processing for an Uncertan World. In Proc. CIDR 2003.Google Scholar
- Conway, N. Transactions and Data Stream Processing. http://neilconway.org/docs/thesis.pdf. April 2008.Google Scholar
- Dean, J., Ghemawat, S. MapReduce: Simplified Data Processing on Large Clusters. OSDI 2004. Google ScholarDigital Library
- DeWitt, D., Gray, J. Parallel Database Systems: The Future of High Performance Database Systems. CACM 35(6) 1992. Google ScholarDigital Library
- Franklin, M., Krishnamurthy, S., et al. Continuous Analytics: Rethinking Query Processing in a Network-Effect World. In CIDR 2009.Google Scholar
- Gray, J., Reuter, A. Transaction Processing: Concepts and Techniques. Morgan Kaufmann 1993, ISBN 1-55860-190-2. Google ScholarDigital Library
- Gray, J., et al. Data Cube: A Relational Aggregation Operator Generalizing Group-by, Cross-Tab and Sub-Total. In ICDE 1996. Google ScholarDigital Library
- Li, J., Tufte, K., Shkapenyuk, V., Papdimos, V., Johnson, T., Maier, D. Out-of-Order Processing: A New Architecture for High-Performance Stream Systems. In Proc. VLDB Endowment (2008), 274--288. Google ScholarDigital Library
- Mohan, C., Haderle, D., et al. ARIES: A Transaction Recovery Method Supporting Fine-Granularity Locking and Partial Rollbacks Using Write-Ahead Logging. ACM TODS 17(1): 94--162 (1992). Google ScholarDigital Library
- Motwani, R., Widom, J., Arasu, A., et al. Query Processing, Resource Management, and Approximation in a Data Stream Management System. In Proc. CIDR 2003.Google Scholar
- Shah, M., Hellerstein, J., Brewer, E. Fault-Tolerant Parallel Dataflows. In SIGMOD 2004. Google ScholarDigital Library
- Srivastava, U., Widom, J. Flexible Time Management in Data Stream Systems. In PODS 2004. Google ScholarDigital Library
- Tucker, P., Maier, D. Exploiting Punctuation Semantics in Data Streams. In ICDE 2002.Google Scholar
- Omniture Web Services API. https://sc.omniture.com/p/l10n/1.0/en_US/docs/WebServices_API_14_Implementation_Manual.pdfGoogle Scholar
Index Terms
- Continuous analytics over discontinuous streams
Recommendations
Characterizing memory requirements for queries over continuous data streams
This article deals with continuous conjunctive queries with arithmetic comparisons and optional aggregation over multiple data streams. An algorithm is presented for determining whether or not any given query can be evaluated using a bounded amount of ...
Exploiting Punctuation Semantics in Continuous Data Streams
As most current query processing architectures are already pipelined, it seems logical to apply them to data streams. However, two classes of query operators are impractical for processing long or infinite data streams. Unbounded stateful operators ...
Sketching distributed sliding-window data streams
While traditional data management systems focus on evaluating single, ad hoc queries over static data sets in a centralized setting, several emerging applications require (possibly, continuous) answers to queries on dynamic data that is widely ...
Comments