Abstract
We consider external algorithms for skyline computation without pre-processing. Our goal is to develop an algorithm with a good worst case guarantee while performing well on average. Due to the nature of disks, it is desirable that such algorithms access the input as a stream (even if in multiple passes). Using the tools of randomness, proved to be useful in many applications, we present an efficient multi-pass streaming algorithm, RAND, for skyline computation. As far as we are aware, RAND is the first randomized skyline algorithm in the literature.
RAND is near-optimal for the streaming model, which we prove via a simple lower bound. Additionally, our algorithm is distributable and can handle partially ordered domains on each attribute. Finally, we demonstrate the robustness of RAND via extensive experiments on both real and synthetic datasets. RAND is comparable to the existing algorithms in average case and additionally tolerant to simple modifications of the data, while other algorithms degrade considerably with such variation.
- L. Arge and J. S. Vitter. Optimal dynamic interval management in external memory. In FOCS, pages 560--569, 1996. Google ScholarDigital Library
- W.-T. Balke, U. Güntzer, and J. X. Zheng. Efficient distributed skylining for web information systems. In EDBT, pages 256--273, 2004.Google ScholarCross Ref
- O. Barndorff-Nielsen and M. Sobel. On the distribution of the number of admissible points in a vector random sample. Theory of Probability and its Applications, 11(2):249--269, 1966.Google ScholarCross Ref
- S. Börzsönyi, D. Kossmann, and K. Stocker. The skyline operator. In ICDE, pages 421--430, 2001. Google ScholarDigital Library
- C.-Y. Chan, P.-K. Eng, and K.-L. Tan. Efficient processing of skyline queries with partially-ordered domains. In ICDE'05, pages 190--191, Washington, DC, USA, 2005. IEEE Computer Society. Google ScholarDigital Library
- C. Y. Chan, P.-K. Eng, and K.-L. Tan. Stratified computation of skylines with partially-ordered domains. In SIGMOD Conference, pages 203--214, 2005. Google ScholarDigital Library
- S. Chaudhuri, N. Dalvi, and R. Kaushik. Robust cardinality and cost estimation for skyline operator. In ICDE'06, page 64, Washington, DC, USA, 2006. IEEE Computer Society. Google ScholarDigital Library
- J. Chomicki, P. Godfrey, J. Gryz, and D. Liang. Skyline with presorting: Theory and optimizations. In Intelligent Information Systems, pages 595--604, 2005. Also appeared in ICDE'03.Google Scholar
- M. Frigo, C. E. Leiserson, H. Prokop, and S. Ramachandran. Cache-oblivious algorithms. In FOCS, pages 285--298, 1999. Google ScholarDigital Library
- P. Godfrey, R. Shipley, and J. Gryz. Algorithms and analyses for maximal vector computation. VLDB J., 16(1):5--28, 2007. Also appeared in VLDB'05. Google ScholarDigital Library
- M. Greenwald and S. Khanna. Space-efficient online computation of quantile summaries. In SIGMOD '01, pages 58--66, New York, NY, USA, 2001. ACM. Google ScholarDigital Library
- B. Jiang and J. Pei. Online interval skyline queries on time series. In ICDE, 2009. Google ScholarDigital Library
- D. Kossmann, F. Ramsak, and S. Rost. Shooting stars in the sky: an online algorithm for skyline queries. In VLDB'02, pages 275--286. VLDB Endowment, 2002. Google ScholarDigital Library
- H. T. Kung, F. Luccio, and F. P. Preparata. On finding the maxima of a set of vectors. J. ACM, 22(4):469--476, 1975. Google ScholarDigital Library
- X. Lin, Y. Yuan, W. Wang, and H. Lu. Stabbing the sky: Efficient skyline computation over sliding windows. In ICDE, pages 502--513, 2005. Google ScholarDigital Library
- J. Matousek. Computing dominances in en. Inf. Process. Lett., 38(5):277--278, 1991. Google ScholarDigital Library
- M. D. Morse, J. M. Patel, and H. V. Jagadish. Efficient skyline computation over low-cardinality domains. In VLDB, pages 267--278, 2007. Google ScholarDigital Library
- D. Papadias, Y. Tao, G. Fu, and B. Seeger. Progressive skyline computation in database systems. ACM Trans. Database Syst., 30(1):41--82, 2005. Also appeard in SIGMOD'03. Google ScholarDigital Library
- S. Park, T. Kim, J. Park, J. Kim, and H. Im. Parallel skyline computation on multicore architectures. In ICDE, 2009. Google ScholarDigital Library
- J. Pei, B. Jiang, X. Lin, and Y. Yuan. Probabilistic skylines on uncertain data. In VLDB, pages 15--26, 2007. Google ScholarDigital Library
- J. M. Ruhl. Efficient algorithms for new computational models. PhD thesis, 2003. Supervisor-David R. Karger. Google ScholarDigital Library
- D. Sacharidis, S. Papadopoulos, and D. Papadias. Topologically sorted skylines for partially ordered domains. In ICDE, 2009. Google ScholarDigital Library
- N. Schweikardt. Machine models and lower bounds for query processing. In PODS, pages 41--52, 2007. Google ScholarDigital Library
- N. Schweikardt. Lower bounds for multi-pass processing of multiple data streams. In STACS, pages 51--61, 2009.Google Scholar
- K.-L. Tan, P.-K. Eng, and B. C. Ooi. Efficient progressive skyline computation. In VLDB '01, pages 301--310, San Francisco, CA, USA, 2001. Morgan Kaufmann Publishers Inc. Google ScholarDigital Library
- Y. Tao, L. Ding, X. Lin, and J. Pei. Distance-based representative skyline. In ICDE, 2009. Google ScholarDigital Library
- Y. Tao and D. Papadias. Maintaining sliding window skylines on data streams. IEEE Trans. Knowl. Data Eng., 18(2):377--391, 2006. Google ScholarDigital Library
- R. Torlone, P. Ciaccia, and U. Romatre. Which are my preferred items. In In Workshop on Recommendation and Personalization in E-Commerce, pages 1--9, 2002.Google Scholar
- J. S. Vitter. Random sampling with a reservoir. ACM Trans. Math. Softw., 11(1):37--57, 1985. Also appeared in FOCS'83. Google ScholarDigital Library
- J. S. Vitter. Algorithms and data structures for external memory. Foundations and Trends in Theoretical Computer Science, 2(4):305--474, 2006. Google ScholarDigital Library
- A. Vlachou, C. Doulkeridis, Y. Kotidis, and M. Vazirgiannis. Skypeer: Efficient subspace skyline computation over distributed data. In ICDE, pages 416--425, 2007.Google ScholarCross Ref
- R. C.-W. Wong, A. W.-C. Fu, J. Pei, Y. S. Ho, T. Wong, and Y. Liu. Efficient skyline querying with variable user preferences on nominal attributes. PVLDB, 1(1):1032--1043, 2008. Google ScholarDigital Library
- R. C.-W. Wong, J. Pei, A. W.-C. Fu, and K. Wang. Online skyline analysis with dynamic preferences on nominal attributes. IEEE Trans. Knowl. Data Eng., 21(1):35--49, 2009. Google ScholarDigital Library
- A. C.-C. Yao. Probabilistic computations: Toward a unified measure of complexity (extended abstract). In FOCS, pages 222--227, 1977. Google ScholarDigital Library
- W. Zhang, X. Lin, Y. Zhang, W. Wang, and J. X. Yu. Probabilistic skyline operator over sliding windows. In ICDE, 2009. Google ScholarDigital Library
- L. Zhu, Y. Tao, and S. Zhou. Distributed skyline retrieval with low bandwidth consumption. IEEE Transactions on Knowledge and Data Engineering, 21(3):384--400, 2009. Google ScholarDigital Library
Index Terms
- Randomized multi-pass streaming skyline algorithms
Recommendations
Multi-Pass Streaming Algorithms for Monotone Submodular Function Maximization
AbstractWe consider maximizing a monotone submodular function under a cardinality constraint or a knapsack constraint in the streaming setting. In particular, the elements arrive sequentially and at any point of time, the algorithm has access to only a ...
Lower bounds for quantile estimation in random-order and multi-pass streaming
ICALP'07: Proceedings of the 34th international conference on Automata, Languages and ProgrammingWe present lower bounds on the space required to estimate the quantiles of a stream of numerical values. Quantile estimation is perhaps the most studied problem in the data stream model and it is relatively well understood in the basic single-pass data ...
Tight Space-Approximation Tradeoff for the Multi-Pass Streaming Set Cover Problem
PODS '17: Proceedings of the 36th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database SystemsWe study the classic set cover problem in the streaming model: the sets that comprise the instance are revealed one by one in a stream and the goal is to solve the problem by making one or few passes over the stream while maintaining a sublinear space o(...
Comments