skip to main content
research-article

Estimating quantiles from the union of historical and streaming data

Published:01 November 2016Publication History
Skip Abstract Section

Abstract

Modern enterprises generate huge amounts of streaming data, for example, micro-blog feeds, financial data, network monitoring and industrial application monitoring. While Data Stream Management Systems have proven successful in providing support for real-time alerting, many applications, such as network monitoring for intrusion detection and real-time bidding, require complex analytics over historical and real-time data over the data streams. We present a new method to process one of the most fundamental analytical primitives, quantile queries, on the union of historical and streaming data. Our method combines an index on historical data with a memory-efficient sketch on streaming data to answer quantile queries with accuracy-resource tradeoffs that are significantly better than current solutions that are based solely on disk-resident indexes or solely on streaming algorithms.

References

  1. D. J. Abadi, Y. Ahmad, M. Balazinska, U. Çetintemel, M. Cherniack, J.-H. Hwang, W. Lindner, A. Maskey, A. Rasin, E. Ryvkina, N. Tatbul, Y. Xing, and S. B. Zdonik. The design of the borealis stream processing engine. In CIDR, pages 277--289, 2005.Google ScholarGoogle Scholar
  2. A. Aggarwal and S. Vitter, Jeffrey. The input/output complexity of sorting and related problems. Commun. ACM, 31(9):1116--1127, Sept. 1988. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. R. Agrawal and A. N. Swami. A one-pass space-efficient algorithm for finding quantiles. In COMAD, 1995.Google ScholarGoogle Scholar
  4. M. Balazinska, Y. Kwon, N. Kuchta, and D. Lee. Moirae: History-enhanced monitoring. In CIDR, pages 375--386, 2007.Google ScholarGoogle Scholar
  5. S. Chandrasekaran. Query Processing over Live and Archived Data Streams. PhD thesis, Berkeley, CA, USA, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. S. Chaudhuri, R. Motwani, and V. Narasayya. Random sampling for histogram construction: How much is enough? In SIGMOD, pages 436--447, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. G. Cormode, T. Johnson, F. Korn, S. Muthukrishnan, O. Spatscheck, and D. Srivastava. Holistic UDAFs at streaming speeds. In SIGMOD, pages 35--46, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. N. Dindar, P. M. Fischer, M. Soner, and N. Tatbul. Efficiently correlating complex events over live and archived data streams. In DEBS, pages 243--254, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. N. Dindar, P. M. Fischer, and N. Tatbul. Dejavu: A complex event processing system for pattern matching over live and historical data streams. In DEBS, pages 399--400, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. U. Fiedler and B. Plattner. Using latency quantiles to engineer qos guarantees for web services. In IWQoS, pages 345--362, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. P. B. Gibbons, Y. Matias, and V. Poosala. Fast incremental maintenance of approximate histograms. ACM Trans. Database Syst., 27(3):261--298, Sept. 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. L. Golab and T. Johnson. Consistency in a stream warehouse. In CIDR, pages 114--122, 2011.Google ScholarGoogle Scholar
  13. L. Golab, T. Johnson, J. S. Seidel, and V. Shkapenyuk. Stream warehousing with datadepot. In SIGMOD, pages 847--854, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. G. Graefe. Implementing sorting in database systems. ACM Comput. Surv., 38(3), Sept. 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. M. Greenwald and S. Khanna. Space-efficient online computation of quantile summaries. In SIGMOD, pages 58--66, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. T. Johnson and V. Shkapenyuk. Data stream warehousing in tidalrace. In CIDR, 2015.Google ScholarGoogle Scholar
  17. C. P. Kruskal. Searching, merging, and sorting in parallel computation. IEEE Trans. Comput., 32(10):942--946, Oct. 1983. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. G. S. Manku, S. Rajagopalan, and B. G. Lindsay. Approximate medians and other quantiles in one pass and with limited memory. In SIGMOD, pages 426--435, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. G. S. Manku, S. Rajagopalan, and B. G. Lindsay. Random sampling techniques for space efficient online computation of order statistics of large datasets. In SIGMOD, pages 251--262, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. J. I. Munro and M. Paterson. Selection and sorting with limited storage. Theor. Comput. Sci., 12:315--323, 1980.Google ScholarGoogle ScholarCross RefCross Ref
  21. P. E. O'Neil, E. Cheng, D. Gawlick, and E. J. O'Neil. The log-structured merge-tree (LSM-Tree). Acta Inf., 33(4):351--385, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. S. Peng, Z. Li, Q. Li, Q. Chen, W. Pan, H. Liu, and Y. Nie. Event detection over live and archived streams. In Proc. WAIM, pages 566--577, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. F. Reiss, K. Stockinger, K. Wu, A. Shoshani, and J. M. Hellerstein. Enabling real-time querying of live and historical stream data. In SSDBM, pages 28--, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. N. Shrivastava, C. Buragohain, D. Agrawal, and S. Suri. Medians and beyond: New aggregation techniques for sensor networks. In SenSys, pages 239--249, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. K. Tufte, J. Li, D. Maier, V. Papadimos, R. L. Bertini, and J. Rucker. Travel time estimation using NiagaraST and Latte. In SIGMOD, pages 1091--1093, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. L. Wang, G. Luo, K. Yi, and G. Cormode. Quantiles over data streams: An experimental study. In SIGMOD, pages 737--748, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. M. Zaharia, T. Das, H. Li, S. Shenker, and I. Stoica. Discretized streams: An efficient and fault-tolerant model for stream processing on large clusters. In HotCloud, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Q. Zhou, Y. Simmhan, and V. Prasanna. Towards hybrid online on-demand querying of realtime data with stateful complex event processing. In IEEE BigData, 2013.Google ScholarGoogle ScholarCross RefCross Ref

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in

Full Access

  • Published in

    cover image Proceedings of the VLDB Endowment
    Proceedings of the VLDB Endowment  Volume 10, Issue 4
    November 2016
    180 pages
    ISSN:2150-8097
    Issue’s Table of Contents

    Publisher

    VLDB Endowment

    Publication History

    • Published: 1 November 2016
    Published in pvldb Volume 10, Issue 4

    Qualifiers

    • research-article

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader