Abstract
Modern enterprises generate huge amounts of streaming data, for example, micro-blog feeds, financial data, network monitoring and industrial application monitoring. While Data Stream Management Systems have proven successful in providing support for real-time alerting, many applications, such as network monitoring for intrusion detection and real-time bidding, require complex analytics over historical and real-time data over the data streams. We present a new method to process one of the most fundamental analytical primitives, quantile queries, on the union of historical and streaming data. Our method combines an index on historical data with a memory-efficient sketch on streaming data to answer quantile queries with accuracy-resource tradeoffs that are significantly better than current solutions that are based solely on disk-resident indexes or solely on streaming algorithms.
- D. J. Abadi, Y. Ahmad, M. Balazinska, U. Çetintemel, M. Cherniack, J.-H. Hwang, W. Lindner, A. Maskey, A. Rasin, E. Ryvkina, N. Tatbul, Y. Xing, and S. B. Zdonik. The design of the borealis stream processing engine. In CIDR, pages 277--289, 2005.Google Scholar
- A. Aggarwal and S. Vitter, Jeffrey. The input/output complexity of sorting and related problems. Commun. ACM, 31(9):1116--1127, Sept. 1988. Google ScholarDigital Library
- R. Agrawal and A. N. Swami. A one-pass space-efficient algorithm for finding quantiles. In COMAD, 1995.Google Scholar
- M. Balazinska, Y. Kwon, N. Kuchta, and D. Lee. Moirae: History-enhanced monitoring. In CIDR, pages 375--386, 2007.Google Scholar
- S. Chandrasekaran. Query Processing over Live and Archived Data Streams. PhD thesis, Berkeley, CA, USA, 2005. Google ScholarDigital Library
- S. Chaudhuri, R. Motwani, and V. Narasayya. Random sampling for histogram construction: How much is enough? In SIGMOD, pages 436--447, 1998. Google ScholarDigital Library
- G. Cormode, T. Johnson, F. Korn, S. Muthukrishnan, O. Spatscheck, and D. Srivastava. Holistic UDAFs at streaming speeds. In SIGMOD, pages 35--46, 2004. Google ScholarDigital Library
- N. Dindar, P. M. Fischer, M. Soner, and N. Tatbul. Efficiently correlating complex events over live and archived data streams. In DEBS, pages 243--254, 2011. Google ScholarDigital Library
- N. Dindar, P. M. Fischer, and N. Tatbul. Dejavu: A complex event processing system for pattern matching over live and historical data streams. In DEBS, pages 399--400, 2011. Google ScholarDigital Library
- U. Fiedler and B. Plattner. Using latency quantiles to engineer qos guarantees for web services. In IWQoS, pages 345--362, 2003. Google ScholarDigital Library
- P. B. Gibbons, Y. Matias, and V. Poosala. Fast incremental maintenance of approximate histograms. ACM Trans. Database Syst., 27(3):261--298, Sept. 2002. Google ScholarDigital Library
- L. Golab and T. Johnson. Consistency in a stream warehouse. In CIDR, pages 114--122, 2011.Google Scholar
- L. Golab, T. Johnson, J. S. Seidel, and V. Shkapenyuk. Stream warehousing with datadepot. In SIGMOD, pages 847--854, 2009. Google ScholarDigital Library
- G. Graefe. Implementing sorting in database systems. ACM Comput. Surv., 38(3), Sept. 2006. Google ScholarDigital Library
- M. Greenwald and S. Khanna. Space-efficient online computation of quantile summaries. In SIGMOD, pages 58--66, 2001. Google ScholarDigital Library
- T. Johnson and V. Shkapenyuk. Data stream warehousing in tidalrace. In CIDR, 2015.Google Scholar
- C. P. Kruskal. Searching, merging, and sorting in parallel computation. IEEE Trans. Comput., 32(10):942--946, Oct. 1983. Google ScholarDigital Library
- G. S. Manku, S. Rajagopalan, and B. G. Lindsay. Approximate medians and other quantiles in one pass and with limited memory. In SIGMOD, pages 426--435, 1998. Google ScholarDigital Library
- G. S. Manku, S. Rajagopalan, and B. G. Lindsay. Random sampling techniques for space efficient online computation of order statistics of large datasets. In SIGMOD, pages 251--262, 1999. Google ScholarDigital Library
- J. I. Munro and M. Paterson. Selection and sorting with limited storage. Theor. Comput. Sci., 12:315--323, 1980.Google ScholarCross Ref
- P. E. O'Neil, E. Cheng, D. Gawlick, and E. J. O'Neil. The log-structured merge-tree (LSM-Tree). Acta Inf., 33(4):351--385, 1996. Google ScholarDigital Library
- S. Peng, Z. Li, Q. Li, Q. Chen, W. Pan, H. Liu, and Y. Nie. Event detection over live and archived streams. In Proc. WAIM, pages 566--577, 2011. Google ScholarDigital Library
- F. Reiss, K. Stockinger, K. Wu, A. Shoshani, and J. M. Hellerstein. Enabling real-time querying of live and historical stream data. In SSDBM, pages 28--, 2007. Google ScholarDigital Library
- N. Shrivastava, C. Buragohain, D. Agrawal, and S. Suri. Medians and beyond: New aggregation techniques for sensor networks. In SenSys, pages 239--249, 2004. Google ScholarDigital Library
- K. Tufte, J. Li, D. Maier, V. Papadimos, R. L. Bertini, and J. Rucker. Travel time estimation using NiagaraST and Latte. In SIGMOD, pages 1091--1093, 2007. Google ScholarDigital Library
- L. Wang, G. Luo, K. Yi, and G. Cormode. Quantiles over data streams: An experimental study. In SIGMOD, pages 737--748, 2013. Google ScholarDigital Library
- M. Zaharia, T. Das, H. Li, S. Shenker, and I. Stoica. Discretized streams: An efficient and fault-tolerant model for stream processing on large clusters. In HotCloud, 2012. Google ScholarDigital Library
- Q. Zhou, Y. Simmhan, and V. Prasanna. Towards hybrid online on-demand querying of realtime data with stateful complex event processing. In IEEE BigData, 2013.Google ScholarCross Ref
Recommendations
PSoup: a system for streaming queries over streaming data
Abstract.Recent work on querying data streams has focused on systems where newly arriving data is processed and continuously streamed to the user in real time. In many emerging applications, however, ad hoc queries and/or intermittent connectivity also ...
Streaming queries over streaming data
VLDB '02: Proceedings of the 28th international conference on Very Large Data BasesRecent work on querying data streams has focused on systems where newly arriving data is processed and continuously streamed to the user in real-time. In many emerging applications, however, ad hoc queries and/or intermittent connectivity also require ...
Enabling Real-Time Querying of Live and Historical Stream Data
SSDBM '07: Proceedings of the 19th International Conference on Scientific and Statistical Database ManagementApplications that query data streams in order to identify trends, patterns, or anomalies can often benefit from comparing the live stream data with archived historical stream data. However, searching this historical data in real time has been considered ...
Comments